Context Free Grammars and Parsing


Recall the primary task of the parser:  construct a parse tree for an input string (program).  There are two ways to construct a parse tree:

In our class, then, we will be using actual parse trees to describe the parsing process and indirect methods to describe how a parsing program can be designed.  

Starting with Standard EBNF

As usual, we start the task of designing a parser with the formal definition of a language, normally in EBNF.  Note the following things:

Here is an example of converting an EBNF rule into equivalent context free grammar rules. Suppose that the EBNF definition of the if statement in a programming language is:

IfStatement = "if" BooleanExpression "then" Statement [ "else" Statement ]

We can capture the intent of this single EBNF rule with these two context free grammar rules.

<IfStatement> ⟶ if <BooleanExpression> then <Statement>
<IfStatement> ⟶ if <BooleanExpression> then <Statement> else <Statement>

Here, we put angle braces ("<" and ">") around nonterminals and make terminals (tokens) bold. 

One could also use the following context free grammar rules, which are also equivalent to the above single EBNF rule:

<IfStatement> ⟶ if <BooleanExpression> then <Statement> <IfTail>
<IfTail>     ⟶ else <Statement>
<IfTail>     ⟶ ε

In other words, there is no unique set of context free grammar rules that correspond to a set of EBNF rules, but every EBNF grammar can be expressed as a context free grammar. 

Similarly, the EBNF rule

StatementSequence  = Statement { ";" Statement }

can be expressed by the following two context free grammar rules:

<StatementSequence> ⟶ <Statement> <MoreStatements>
<MoreStatements> ⟶ ; <Statement> <MoreStatements>
<MoreStatements> ⟶ ε

This means that the team designing the parsermust often fiddle with the EBNF to turn it into a CFG, and then further fiddle with the CFG to ensure that it has the properties necessary for the chosen parsing scheme.  For example, we don't want ambiguity in our grammar, so we modify ambiguous grammars to make them unambiguous (if possible).  We do similar things with other undesirable grammar properties that we may encounter (to be discussed later).  There are a few hurdles, however.

Approaches to Constructing a Parse Tree

The changes we must make to a grammar to get it in a form ready for parsing a particular language are actually dependent on the parsing method chosen.   There are two major methods:

Top Down Parsing

Top-down.  In this case the parse proceeds from the root (the start symbol of the grammar).  At each step in the parse, the leaves of the current parse tree are scanned from left to right, and the leftmost nonterminal is the next to be expanded using a rule from the grammar.

Bottom Up Parsing

Bottom-up.  In this case the parse proceeds from the tokens back towards the root.  At each step in the parse, all unattached nodes (nodes with no parent) are reviewed to see whether a right hand side of some rule in the grammar can be found.  When the correct right hand side is found, the left hand side is inserted into the tree, and its children (the right hand side elements of the rule) are attached to it.

Practical Parsing Considerations

As with all programming tasks, however, we want to make the task of parsing as efficient as possible.  In particular, we want to be able to build parse trees under the following constraints:

Meeting the Objectives of Practical Parsing

Remember, there are many different grammars for a single programming language.   Not all of them will fit our objectives of parsing with no backtracking and one lookahead token.  This means that care must be taken in constructing a grammar that meets the objectives.  The process proceeds as follows:

Examples

Try some parses with the following three grammars.

Grammar 1:

1. <S> ⟶ <expression> eof
2. <expression> ⟶ <expression> - <expression>
3. <expression> ⟶ <expression> + <expression>
4. <expression> ⟶ <expression> * <expression>
5. <expression> ⟶ <expression> / <expression>
6. <expression> ⟶ <expression> ^ <expression>
7. <expression> ⟶ identifier
8. <expression> ⟶ integer_literal
9. <expression> ⟶ ( <expression> )

Try to parse a + b * c and a * b + c using this grammar.  Notice that the grammar is ambiguous.  There is more than one way to do each of these parses.  The result is that operator precedence is not properly handled here.

Grammar 2

1.  <S> ⟶ <expression> eof
2.  <expression> ⟶ <expression> + <term>
3.  <expression> ⟶ <expression> - <term>
4.  <expression> ⟶ <term>
5.  <term>       ⟶ <term> * <factor>
6.  <term>       ⟶ <term> / <factor>
7.  <term>       ⟶ <factor>
8.  <factor>     ⟶ <factor> ^ <primary>
9.  <factor>     ⟶ <primary>
10. <primary>    ⟶ identifier
11. <primary>    ⟶ integer_literal
12. <primary>    ⟶ ( <expression> )

Try parsing a + b * c and a * b + c with this grammar as well.  The grammar is certainly uglier in terms of readability than the first grammar, but there is only one way to do each of these parses.  In fact, this grammar is unambiguous.  So, this grammar is better for programmed parsers, even though it is less clear on first glance for a human.  Grammar 2 also has the advantage that it captures operator precedence correctly.  It is also LR(1), which means that it can be parsed bottom up with no backtracking and one token of lookahead.

Grammar 3


  1.  <S> ⟶ <expression> eof 
  2.  <expression>      ⟶  <term> <expression_tail>
  3.  <expression_tail> ⟶  + <term> <expression_tail>
  4.  <expression_tail> ⟶  - <term> <expression_tail>
  5.  <expression_tail> ⟶  ε
  6.  <term>            ⟶  <factor> <term_tail>
  7.  <term_tail>       ⟶  * <factor> <term_tail>
  8.  <term_tail>       ⟶  / <factor> <term_tail>
  9.  <term_tail>       ⟶  ε
  10. <factor>          ⟶  <primary><factor_tail>
  11. <factor_tail>     ⟶  ^ <primary> <factor_tail>
  12. <factor_tail>     ⟶  ε
  13. <primary>         ⟶  identifier
  14. <primary>         ⟶  integer_literal
  15. <primary>         ⟶  ( <expression> )
  

Try parsing a + b * c and a * b + c with this grammar as well.  The grammar is certainly even uglier in terms of readability than the first two grammars, but there is only one way to do each of these parses.  In fact, this grammar is also unambiguous.  So, this grammar is better for programmed parsers, even though it is less clear on first glance for a human.  Grammar 3 also has the advantage that it captures operator precedence correctly.  It is also LL(1), which means that it can be parsed top down with no backtracking and one token of lookahead.