Parsing: introduction

A parser is a tool that transforms a stream into a structure.

The input stream is typically a file or string containing characters. The output structure is ususally a syntax tree, but it can be many other things:

This series of articles will take you through many different parsing technologies, ranging over Python, C#, Tcl, C++, Lex & yacc, Antlr, OMeta, and Schol, among others. Even if you're not planning to write a parser, you may learn some general-purpose tricks, such as reflection in C# or the syntax magic of Tcl.

Data formats are typically easier to parse than full programming languages. We will cover aspects of both. To parse a program written in a higher-level language such as C or OCaml, you often need more than just a parser: you need symbol tables, semantic information, and a type system. These are not on our radar here; we focus exclusively on parsing.

Parsing is easier than you think

Parsers are useful for many different tasks, from reading a simple configuration file to compiling a full programming language. In fact, they are so useful that they should be part of your programming toolbox, right next to linked lists and sorting algorithms.

Many programmers think that parsing is difficult. I guess parsing is too often explained in terms of abstract theories, which makes it seem complicated. Experts enjoy juggling difficult concepts and jargon such as "context-free left-factored grammars" and "non-deterministic pushdown automata". It sounds scary, but the plain truth is that you don't need any of this stuff to write a good parser. Most of the techniques we will cover are extremely simple and yet very powerful at the same time.

When explaining the parsing process, many articles and papers start from the theory and then move on to practical examples. We will of course do exactly the opposite: We start from very concrete hand-written parsers for simple (sometimes trivial) languages. Then we introduce the abstraction layers, moving towards grammars and other more theoretical concepts in small steps.

Series overview