2 Posts Tagged 'Parsing'
Review: Parsing Techniques: A Practical Guide
For Christmas this year, I received a shiny hardback copy of Parsing Techniques: A Practical Guide by Grune and Jacobs. It's a thrilling book, if you want to learn parsing, which I do.
Where most books proceed in a sort of linear fashion, this book teaches parsing in layers. First you learn what a grammar is. Then you learn what it means to parse: what's a parse tree? What's bottom-up vs. top-down? What's a leftmost vs. rightmost derivation?
Next you get some general ideas and methods for parsing, e.g. CYK and Unger, and then you dive into the implementations of parsers (in pseudocode and in C) in great detail. This is about as far as I've gotten so far, before having to go back and figure out what the heck I just read. But it's an interesting progression. Reading the book, I feel like I'm constantly revisiting things I learned a few chapters ago, but this time in more detail. The book kind of does a breadth-first traversal of the world of parsing.
Be warned however: this book is not easy reading. It's dense, heavy on the info, light on the entertainment. Unless you really get a kick of out parsing, this will probably put you to sleep if taken in large doses. But it is a trove of information, and I couldn't put the book down during certain chapters.
In fact there's so much information in this book that it's almost depressing. The bibliography alone takes up 1/4 of the book, and lists 1,500(!) authors. It'd take me a week to read the bibliography, and probably many years to read every book listed there. Parsing could easy consume a lifetime of study, and I'm saddened that I'm probably never going to find the time to master all there is to know. But such is life.
If I had one quibble with this book, it'd be the same quibble I have with most math papers. The notation is horrible. Say what you will about programmers, most of us know that code is written for humans, not for machines, and we give our variables descriptive names. In math it's all single letters variable names.
When the authors of this book run out of single letters, they use letters with bars over them, or bold letters vs. normal typeface letters, or they do things like this:
...whenever a non-terminal A is entered in to entry Ri,l of the recognition table because there is a rule A -> BC and B is in Ri,k, and C is in Ri+k,l-k, the rule A_i_l -> B_i_k C_m_n is added to the parse forest grammar, where m = i + k and n = i + l - k.
This is the first paragraph of a section. Those variables are not mentioned before this sentence. This is certainly not a style of writing that I'm used to reading. It takes me a good dozen tries to understand. (Using lowercase i's and l's right next to each other should be prohibited by law.)
In any case, this book is good. One of my favorite tools has always been Perl-style regular expressions, and I feel like this book has expanded my understanding of how they work. Learning to write a recognizer, learning how things are implemented under the hood, you couldn't ask for a more interesting topic. I can't wait to try writing a toy parser generator or regex recognizer in Clojure once I've solidified my understanding of some of these concepts.
Let's parse
Is there anything more fun than parsing strings? I submit to you that there is not. I'm currently reading my way through Parsing Techniques - A Practical Guide, which has a first edition free online. (I'm hoping Santa brings me a copy of the 2nd edition this year.)
This is a good book, with enough math to be rigorous but not so much that it's completely unreadable. It starts from the absolute basics ("What's a grammar?") and goes through the Chomsky hierarchy and then dives into parsing techniques in great detail, in a language-agnostic way.
Languages and grammars are fascinating. In high school I studied Spanish, French, Latin and German, largely in my spare time. When I was 16, if people asked what I wanted to do for a living, I said "translator".
The plan to become a translator failed partly because the quality of my early education was horrendous and partly because mastering a language is extremely difficult and at 16 I wasn't motived enough. And then computers showed up in my life, which gave me a never-ending supply of languages to play with, while being fun (and profitable) in so many other ways. But I still took two years of Japanese classes in college for no reason other than enjoyment, and I'm still trying (and failing) to learn Japanese in my spare time 8 years later.
Perl was my first favorite language probably for no reason other than regular expressions. I can understand how people call PCRE syntax line-noise, but to me it's beautiful line noise. I live and breathe regular expressions nowadays. My favorite CS class in college was one where we went through and laboriously built finite-state automata and pushdown automata and Turing machines. Seeing the equivalence of these simple machines with the different classes of grammars was a huge epiphany. Such a simple concept with such huge consequences.
Dijkstra said:
Besides a mathematical inclination, an exceptionally good mastery of one's native tongue is the most vital asset of a competent programmer.
I strongly agree with that sentiment. People tell me at times that I'm good at written communication. I have my doubts, and anyways I find it funny because I'm so terrible at verbal communication. I think if I have any success at writing, it's because I view writing as a mechanical process.
I told a prof in college once that I felt like my papers wrote themselves once I had an idea in mind. There are rules of grammar and style, and you learn them and follow them, or break them deliberately if you have a good reason to. You write some prose, then you debug it until it "works" mentally. I don't care about typos and I split infinitives and comma-splice on purpose, but ambiguous or awkward phrases usually stand out to me like compiler bugs in my brain.
What's more important than language? Few things. Language is important enough to be nearly hard-wired into our brains. Children learn it instinctively. Human beings can still easily and effortlessly out-perform the best supercomputer at the task of parsing and interpreting speech. We think in words. The programming languages computers understand are dirt-simple by comparison, but writing code still feels like writing "thoughts for the computer" sometimes.
There are very few times you'll hear me say "What a wonderful world we live in". But one of those times is when I have the opportunity to explore an area of study like language. It's such an enjoyable experience to struggle and try to master such a thing. It's an amazing universe where we have these weird little rules and they work and we can understand them and manipulate them and produce things with them.
