For Christmas this year, I received a shiny hardback copy of Parsing Techniques: A Practical Guide by Grune and Jacobs. It's a thrilling book, if you want to learn parsing, which I do.
Where most books proceed in a sort of linear fashion, this book teaches parsing in layers. First you learn what a grammar is. Then you learn what it means to parse: what's a parse tree? What's bottom-up vs. top-down? What's a leftmost vs. rightmost derivation?
Next you get some general ideas and methods for parsing, e.g. CYK and Unger, and then you dive into the implementations of parsers (in pseudocode and in C) in great detail. This is about as far as I've gotten so far, before having to go back and figure out what the heck I just read. But it's an interesting progression. Reading the book, I feel like I'm constantly revisiting things I learned a few chapters ago, but this time in more detail. The book kind of does a breadth-first traversal of the world of parsing.
Be warned however: this book is not easy reading. It's dense, heavy on the info, light on the entertainment. Unless you really get a kick of out parsing, this will probably put you to sleep if taken in large doses. But it is a trove of information, and I couldn't put the book down during certain chapters.
In fact there's so much information in this book that it's almost depressing. The bibliography alone takes up 1/4 of the book, and lists 1,500(!) authors. It'd take me a week to read the bibliography, and probably many years to read every book listed there. Parsing could easy consume a lifetime of study, and I'm saddened that I'm probably never going to find the time to master all there is to know. But such is life.
If I had one quibble with this book, it'd be the same quibble I have with most math papers. The notation is horrible. Say what you will about programmers, most of us know that code is written for humans, not for machines, and we give our variables descriptive names. In math it's all single letters variable names.
When the authors of this book run out of single letters, they use letters with bars over them, or bold letters vs. normal typeface letters, or they do things like this:
...whenever a non-terminal A is entered in to entry Ri,l of the recognition table because there is a rule A -> BC and B is in Ri,k, and C is in Ri+k,l-k, the rule A_i_l -> B_i_k C_m_n is added to the parse forest grammar, where m = i + k and n = i + l - k.
This is the first paragraph of a section. Those variables are not mentioned before this sentence. This is certainly not a style of writing that I'm used to reading. It takes me a good dozen tries to understand. (Using lowercase i's and l's right next to each other should be prohibited by law.)
In any case, this book is good. One of my favorite tools has always been Perl-style regular expressions, and I feel like this book has expanded my understanding of how they work. Learning to write a recognizer, learning how things are implemented under the hood, you couldn't ask for a more interesting topic. I can't wait to try writing a toy parser generator or regex recognizer in Clojure once I've solidified my understanding of some of these concepts.