Apparently, nobody writes tokenizers like this

Apparently, nobody writes tokenizers like this

Call me crazy, but I've never really seen the point of so-called "parser generators" and "lexical analyzer generators" in real life. Almost any file has syntax that's so simple, it's easy to just parse it yourself. And languages that are more complicated or have tight parser performance requirements - like C++ compilers or Ruby interpreters - tend to have hand-rolled parsers because the automatic parser generators can't do it.

So who benefits from automatic parser generators? I don't know. I feel like I'm missing something.

This feeling came up again the other day when I found I had to parse and produce some XML files at work - in Delphi. Having seen lots of advice in lots of places that "the first thing a new programmer does with XML is to try, and fail, to write his own XML parser," I was hesitant. Okay, I thought. Why not look into one of those well-known, fancy pants XML parsers that will surely solve my problem in two seconds flat?

Well, I looked into them. Much more than two seconds later, I emerged, horrified. How can you guys possibly make parsing a text file so complicated? Why, after adding your tool, does my project now seem more difficult than it did when I started?

I still don't know. Look guys, I really tried. But I just don't understand why I'd want to use a DTD. Or twelve layers of abstraction. Or the "structured" way you completely reject (with confusing error messages) almost-but-not-quite valid XML files, in clear violation of Jon Postel's robustness principle.

So I broke down and wrote an XML parser myself. In Delphi. In about 500 lines. In an afternoon. I'm sure I left out major portions of the XML spec, but you know what? It parses the file the customer sent me, and the Big Fancy Professional XML Library didn't, because it said the file was invalid.

I guess that makes me a clueless newbie.

But back to tokenizers

As an odd coincidence, someone I know was doing some (much less redundant) work on parsing a different file format at around the same time. As anyone who has done parsers should know, most parsers are divided into two main parts: lexical analysis (which I'll call "tokenizing") and parsing.

I agree with this distinction. Unfortunately, that seems to be where my formal education ends, because I just can't figure out why lexical analysis is supposed to be so difficult. Almost all the lexical analyzers I've seen have been state machines driven by a single main loop, with a whole bunch of if statements and/or a switch/case statement and/or function pointers and/or giant object inheritance hierarchies.

Sure enough, the person I was talking to was writing just such a tokenizer in python - with lambdas and all the rest.

The problem is I just don't understand why all that stuff should be necessary. Traditional lexical analysis seems to be based on the theory that you need to have a single outer main loop, or you'll be inefficient / redundant / impure. But what I think is that loop constructs are generally only a single line of code; it doesn't cost you anything to put loops in twelve different places. So that's what I did.

I suppose that makes me a newbie. But it works. And my code is more readable than his. In fact, when I showed him a copy, he was amazed at how simple it is. He actually called it brilliant. Seriously.

To be honest, I still feel like I must be missing something. And yet here we are.

...so without further ado, my XML tokenizer, in 62 lines of Pascal. For your convenience, I have highlighted the blasphemous parts.

function iswhite(c: char): boolean;
begin
   result := c in [' ', #10, #13, #9];
end;

function important(c: char): boolean;
begin
   result := c in ['<', '>', '=', '/', '?', '"', ''''];
end;

function next_token(s: string; var i: integer): TPwToken;
var
   start, max: integer;
begin
   start := i;
   max := length(s)+1;
   result.val := '';

   if i >= max then begin
       result.ttype := ttEof;
   end else if important(s[i]) then begin
       if s[i] = '<' then begin
           result.ttype := ttLt;
           inc(i);
       end else if s[i] = '>' then begin
           result.ttype := ttGt;
           inc(i);
       end else if s[i] = '=' then begin
           result.ttype := ttEquals;
           inc(i);
       end else if s[i] = '/' then begin
           result.ttype := ttSlash;
           inc(i);
       end else if s[i] = '?' then begin
           result.ttype := ttQuestionMark;
           inc(i);
       end else if s[i] = '"' then begin
           result.ttype := ttString;
           inc(i);
           while (i<max) and (s[i] <> '"') do inc(i);
           inc(i);
           result.val := copy(s, start, i-start);
       end else if s[i] = '''' then begin
           result.ttype := ttString;
           inc(i);
           while (i<max) and (s[i] <> '''') do inc(i);
           inc(i);
           result.val := copy(s, start, i-start);
       end else begin
           assert(false, 'char isimportant but unrecognized?');
       end;
   end else if iswhite(s[i]) then begin
       result.ttype := ttWhitespace;
       while (i<max) and iswhite(s[i]) do inc(i);
       result.val := copy(s, start, i-start);
   end else begin
       result.ttype := ttString;
       while (i<max) and (not important(s[i])) and (not iswhite(s[i])) do inc(i);
       result.val := copy(s, start, i-start);
   end;
end;

Update (2009/02/22): Reddit discussion of this article.

::li,nv

2009-02-20 »