Introduction to Parsing

Context Free Grammars and Parsing

Recall the primary task of the parser: construct a parse tree for an input string (program). There are two ways to construct a parse tree:

Directly. In this case, an actual parse tree is constructed. This method is normally always used in class for describing how a parse proceeds. An actual tree is also maintained by more complex compilers (usually those requiring multiple passes).
Indirectly. In this case we don't actually construct the tree as a data structure with nodes and links, but rather we do the parsing using other methods, such as the use of a stack (recall that a pushdown automaton exists for every context free language that can parse---accept or reject---a string submitted to it). This method is used most often in one-pass compilers, which don't need to keep the entire parse tree around for later passes on the way to producing target machine code.

In our class, then, we will be using actual parse trees to describe the parsing process and indirect methods to describe how a parsing program can be designed.

Starting with Standard EBNF

As usual, we start the task of designing a parser with the formal definition of a language, normally in EBNF. Note the following things:

EBNF is generally for human consumption. It is easier to read than a standard cfg.
EBNF is often not in the desired form for parsing if a parser is being built by hand (actually, EBNF can be used for hand-built parsers, but in this course it will help us to understand the parser much better and more efficiently to look at a context free grammar equivalent to the EBNF for the programming language we are conpiling).
EBNF must therefore often be converted to an equivalent context free grammar form in order to use it for a hand written compiler, although, as noted in the previous bullet, there are techniques used in recursive descent compilers that closely follow the EBNF rules.
Once the EBNF is converted to an equivalent context free grammar, the grammar must usually be modified for use with a particular parsing technique.
EBNF is appropriate as input to some parser tools (notably the one called yacc). Even then, however, the EBNF must often be manipulated to make it usable for a particular parsing strategy.

Here is an example of converting an EBNF rule into equivalent context free grammar rules. Suppose that the EBNF definition of the if statement in a programming language is:

IfStatement = "if" BooleanExpression "then" Statement [ "else" Statement ]

We can capture the intent of this single EBNF rule with these two context free grammar rules.

<IfStatement> ⟶ if <BooleanExpression> then <Statement> <IfStatement> ⟶ if <BooleanExpression> then <Statement> else <Statement>

Here, we put angle braces ("<" and ">") around nonterminals and make terminals (tokens) bold.

One could also use the following context free grammar rules, which are also equivalent to the above single EBNF rule:

<IfStatement> ⟶ if <BooleanExpression> then <Statement> <IfTail> <IfTail> ⟶ else <Statement> <IfTail> ⟶ ε

In other words, there is no unique set of context free grammar rules that correspond to a set of EBNF rules, but every EBNF grammar can be expressed as a context free grammar.

Similarly, the EBNF rule

StatementSequence = Statement { ";" Statement }

can be expressed by the following two context free grammar rules:

<StatementSequence> ⟶ <Statement> <MoreStatements> <MoreStatements> ⟶ ; <Statement> <MoreStatements> <MoreStatements> ⟶ ε

This means that the team designing the parsermust often fiddle with the EBNF to turn it into a CFG, and then further fiddle with the CFG to ensure that it has the properties necessary for the chosen parsing scheme. For example, we don't want ambiguity in our grammar, so we modify ambiguous grammars to make them unambiguous (if possible). We do similar things with other undesirable grammar properties that we may encounter (to be discussed later). There are a few hurdles, however.

Some context free languages are inherently ambiguous. This just means that there are no unambiguous grammars for these languages.
If some small parts of a programming language are inherently ambiguous, as is usual when practice meets theory, we just find a workaround for those small pieces.
One aspect of programming languages that is not possible to capture in any context free grammar at all is the requirement that variables be used only in the context of their declarations. Since declarations are made arbitrarily far earlier than the use of a variable, such information is context sensitive and cannot be captured in a context free grammar. The symbol table is the workaround for this problem.

Approaches to Constructing a Parse Tree

The changes we must make to a grammar to get it in a form ready for parsing a particular language are actually dependent on the parsing method chosen. There are two major methods:

Top Down Parsing

Top-down. In this case the parse proceeds from the root (the start symbol of the grammar). At each step in the parse, the leaves of the current parse tree are scanned from left to right, and the leftmost nonterminal is the next to be expanded using a rule from the grammar.

Bottom Up Parsing

Bottom-up. In this case the parse proceeds from the tokens back towards the root. At each step in the parse, all unattached nodes (nodes with no parent) are reviewed to see whether a right hand side of some rule in the grammar can be found. When the correct right hand side is found, the left hand side is inserted into the tree, and its children (the right hand side elements of the rule) are attached to it.

Practical Parsing Considerations

As with all programming tasks, however, we want to make the task of parsing as efficient as possible. In particular, we want to be able to build parse trees under the following constraints:

No backtracking! We do not want to build a parser that tries to apply certain rules in constructing the parse tree, only to discover later that different rules should have been applied, causing the parser to backtrack in order to deconstruct the tree built so far and reconstruct it with different rules. For example, there may be three rules with <expression> on the left. We want to be able to choose the correct rule to use in a particular parse without being required to try all three to see which one (if any) works.
One lookahead token! We want to be able to proceed as far as possible in a non-backtracking parse knowing just the next token seen by the scanner. The parser should not be required to examine many tokens at one time in order to determine which rule to apply in the parse tree.

Meeting the Objectives of Practical Parsing

Remember, there are many different grammars for a single programming language. Not all of them will fit our objectives of parsing with no backtracking and one lookahead token. This means that care must be taken in constructing a grammar that meets the objectives. The process proceeds as follows:

Start with the formal EBNF definition of the programming language.
Turn the EBNF into context free grammar form
Ensure that the rules of the grammar fit one of two different objectives:
- Top-down parsing with no backtracking and one token of lookahead (a grammar that has these characteristics is called an LL(1) grammar).
- Bottom-up parsing with no backtracking and one token of lookahead (a grammar that has these characteristics is called an LR(1) grammar).

Examples

Try some parses with the following three grammars.

Grammar 1:

1. <S> ⟶ <expression> eof 2. <expression> ⟶ <expression> - <expression> 3. <expression> ⟶ <expression> + <expression> 4. <expression> ⟶ <expression> * <expression> 5. <expression> ⟶ <expression> / <expression> 6. <expression> ⟶ <expression> ^ <expression> 7. <expression> ⟶ identifier 8. <expression> ⟶ integer_literal 9. <expression> ⟶ ( <expression> )

Try to parse a + b * c and a * b + c using this grammar. Notice that the grammar is ambiguous. There is more than one way to do each of these parses. The result is that operator precedence is not properly handled here.

Grammar 2:

1. <S> ⟶ <expression> eof 2. <expression> ⟶ <expression> + <term> 3. <expression> ⟶ <expression> - <term> 4. <expression> ⟶ <term> 5. <term> ⟶ <term> * <factor> 6. <term> ⟶ <term> / <factor> 7. <term> ⟶ <factor> 8. <factor> ⟶ <factor> ^ <primary> 9. <factor> ⟶ <primary> 10. <primary> ⟶ identifier 11. <primary> ⟶ integer_literal 12. <primary> ⟶ ( <expression> )

Try parsing a + b * c and a * b + c with this grammar as well. The grammar is certainly uglier in terms of readability than the first grammar, but there is only one way to do each of these parses. In fact, this grammar is unambiguous. So, this grammar is better for programmed parsers, even though it is less clear on first glance for a human. Grammar 2 also has the advantage that it captures operator precedence correctly. It is also LR(1), which means that it can be parsed bottom up with no backtracking and one token of lookahead.

Grammar 3:


  1.  <S> ⟶ <expression> eof 
  2.  <expression>      ⟶  <term> <expression_tail>
  3.  <expression_tail> ⟶  + <term> <expression_tail>
  4.  <expression_tail> ⟶  - <term> <expression_tail>
  5.  <expression_tail> ⟶  ε
  6.  <term>            ⟶  <factor> <term_tail>
  7.  <term_tail>       ⟶  * <factor> <term_tail>
  8.  <term_tail>       ⟶  / <factor> <term_tail>
  9.  <term_tail>       ⟶  ε
  10. <factor>          ⟶  <primary><factor_tail>
  11. <factor_tail>     ⟶  ^ <primary> <factor_tail>
  12. <factor_tail>     ⟶  ε
  13. <primary>         ⟶  identifier
  14. <primary>         ⟶  integer_literal
  15. <primary>         ⟶  ( <expression> )

Try parsing a + b * c and a * b + c with this grammar as well. The grammar is certainly even uglier in terms of readability than the first two grammars, but there is only one way to do each of these parses. In fact, this grammar is also unambiguous. So, this grammar is better for programmed parsers, even though it is less clear on first glance for a human. Grammar 3 also has the advantage that it captures operator precedence correctly. It is also LL(1), which means that it can be parsed top down with no backtracking and one token of lookahead.