Efficient, Deterministic Parsing

Objective

The objective of parsing is to determine whether a given token input file represents a valid program in the programming language being compiled.

How would you do this if you were not given any directions?  Think about this for a while, and you will see how much we owe to the theoretical underpinnings of formal languages and compiling done by others.

Means

A way to meet this objective is to construct a parse tree for the input program using the rules of the grammar for the programming language.  There are two primary ways to build such a tree:

Grammar Requirements for Practical Parsing

Practical Top Down Parsing

Given the above requirements, we can express what property we need a programming language grammar to have if we intend to parse programs in a top down fashion.  It is this:

The Big Question

The big question is this:  Are we asking too much?  It would make parsing nice if we could build a top-down parser that  could build a parse tree for any input program(or recognize at some point during the parse that no parse tree exists for that purported program) that had all of the above properties.  Are there such grammars for real programming languages, or are the restricitons--no ambiguity, no backtracking, and one lookahead token that always predicts which, if any, rule applies to the leftmost nonterminal in a top down parse--just so restrictive that no real programming language could have such a grammar?

The answers to these questions are, "No, we are not asking too much," and, "Yes, there are context free grammars for most programming languages that have these kinds of properties (with a few minor exceptions)."   How do we know?  Well, once again we have to doff our virtual hats to the theoreticians who explored and developed the theory of grammars with these properties.  We will lean heavily on the theory they developed.

Let's look at an example.

An Intiuitive View

Let's see how top-down, deterministic parsing with one token of lookahead might go, given the proper grammar.  We will use grammar 3 from last time.


   1 <expression>      ⟶  <term> <expression_tail>
   2 <expression_tail> ⟶  + <term> <expression_tail>
   3 <expression_tail> ⟶  - <term> <expression_ail>
   4 <expression_tail> ⟶  ε
   5 <term>            ⟶  <factor> <term_tail>
   6 <term_tail>       ⟶  * <factor> <term_tail>
   7 <term_tail>       ⟶  / <factor> <term_tail>
   8 <term_tail>       ⟶  ε
   9 <factor>          ⟶  <primary><factor_tail>
  10 <factor_tail>     ⟶  ^ <primary> <factor_tail>
  11 <factor_tail>     ⟶  ε
  12 <primary>         ⟶  identifier
  13 <primary>         ⟶  integer_literal
  14 <primary>         ⟶  ( <expression> )  

Suppose that two different people are parsing a string using this grammar given only one token of lookahead at any point.  Both start with the start symbol, <expression> (for convenience, we will shorten the names of all of the nonterminal to just the first letters of the nonterminals).

Suppose that both people are given the lookahead token id as the first token of the string they are to parse.  The first rule they both must apply (rule 1) is shown below

Since id is the first token, it will naturally have to appear as the first leaf of the completed parse tree if the parse is to be successful.  You might think that both people used the same rule (1) because it is the only rule with <expression> on the left.  It turns out that we can actually determine at this point whether it is possible to even use this one rule.  An analysis of the grammar shows that applying this rule can indeed eventually lead to an id as the first token.

The lookahead token has not changed.  It is still id.  We only get a new lookahead token when the one we have now is attached to the parse tree (and therefore we are not looking ahead to it any more).  So both people expand their trees as shown next, in which the leftmost nonterminal. <t>, is expanded as usual.

Here, rule 4 is used and it really is the only choice both have.

The next rule must be applied to <f>.  Again both use rule 9.

Finally, both people are now about to expand <p>, the leftmost nonterminal.  Both have the same lookahead token (id).  Both see three different rules that can be applied, namely 12, 13, and 14.  Those are all rules with <p> on the left hand side.  But given that the first token in the file is id it is clear that only rule 12 applies.  If rule 13 were used, we would be hanging an integer literal leaf on the tree, which doesn't match the first token of the input file (the id).  Similarly, if rule 14 were used, a left parenthesis, (, would be placed on the tree in this position which also would not match the id in the token file.

The important point to notice is that even though two different people are parsing two possibly different input token files they have been forced to use the same rules at each step so far based on the fact that the lookahead was the same for both throughout.  There were no other choices.

At this point we have hung the id token on the tree where it belongs.  To do continue parsing, we need to get the next token from the token file as our lookahead token.  Suppose it is the multiplication operator *.  With this lookahead both people need to decide how to expand the leftmost unexpanded nonterminal, which in this case is <f_t>.  Notice that there are two rules with <f_t> on the left:

    
      10. <factor_tail> ⟶ ^ <primary> <factor_tail>
11. <factor_tail> ⟶ ε

 

Both people will notice that rule 10 won't work, because it will hang the token ^ on the tree when the lookahead token implies that the next token to hang on the tree is the *.  So, both people are left with rule 11 (we will see later that we can determine right now with this grammar whether using rule 11 will ever lead to a * being attached to the tree next).

Once again, both people who are doing this parse without knowing what the other is doing are led to use the same rule.  There is no backtracking, because in every instance, only one rule has worked.

So, now the nonterminal to expand is <t_t>.  The lookahead token is still * because it has not yet been hung on the tree.  Look at the three rules with <t_t> on the left:

    
    6.  <term_tail>* <factor> <term_tail>
7. <term_tail>/ <factor> <term_tail>
8. <term_tail> ⟶ ε

Using rule 6 will mean that the lookahead token, *, will be attached to the tree at this point, which is what we desire.  Application of rule 7 would cause a problem, because a / would be hung on the tree where a * was called for.  We can't easily see what would happen if we applied rule 8 (it could be that a * would be arrived at by some expansions following.  Without going into details at this point, an analysis of the grammar will show that if we were to use 8, there is no way that a * could be hung as the next token on the tree.

So, here we are again.  Both people must apply rule 6 to continue the parse.

 

Now notice something different.  If at this point person 1 were given a lookahead token of id whereas person 2 were given a lookahead token of intlit.  Both would apply rule 9  to <f_t>:

9.  <factor> --> <primary> <factor_tail>

At this point, though, person 1 would apply rule 12 to <primary>

12. <primary> --> identifier

because he has a lookahead token of id.  Person 2 would apply rule 13 to <primary> in her tree,

13. <primary> --> integer_literal

because her lookahead token is integer_literal.

The important thing to notice is that if two parses are being done independently, and both have a given nonterminal to expand, and both have the same lookahead token, in both parses the same rule has to be applied.  There are no other choices.

The Abstraction

Given the above considerations for practical parsing in general, what would we want in a practical grammar for top down parsing?  Basically, what we need is a grammar that satisfies the following:

We can envision this by looking at the following general parse tree, illustrated at some point in a top down, leftmost parse.  So far, as tokens have come in, the tree has been expanded to the point that tokens w have already been attached to the tree and now the A shown is the leftmost nonterminal that is now to be expanded.  The greek a just stands for the rest of the leaves to the right of A, some tokens and others nonterminals.

Given a tree like this and given the next token beyond w, the question is which rule we should use to expand A.  If we can determine which rule to use at every point, then we say that the grammar is an LL grammar.

If we have such an LL grammar, then we can parse the string in O(n) time, where n is the number of tokens in the input stream, because at each step we look at the next token and decide exactly which rule to apply to the current nonterminal, and we never return again (no backtracking) to this nonterminal in the tree.

Fortunately for us, the theorists have already answered these questions for us.  To do so, they first define these kinds of grammars and give them a name. 

 

Definition

A context free grammar G is an LL(1) grammar if and only if for any two leftmost derivations

         *        1        *
S ===> wAα ===> wβα ===> wx
lm lm lm
        *        1        *
S ===> wAα ===> wγα ===> wz
lm lm lm

in the grammar, First1(x) = First1(z) implies that β = γ. 

This definition is a precise shorthand for what we intend when we say we want a grammar that is efficiently parsable in a top down fashion.  First let's establish some notation:

mp_program identifier semicolon var identifier colon mp_integer semicolon mp_begin 

indicating that these tokens have already been covered in the leftmost derivation so far.  The symbols x and z are just the remaining tokens that follow w in the two different programs.

Given this notation, the definition just shows two different parses using the same grammar, where in both cases the same nonterminal A is about to be expanded.  In both cases, the first k tokens of w are the same as the first k tokens of z.  In other words, suppose a person is trying to decide which rule to apply to A in either case, and that person only gets to look at the next k tokens and finds them the same in both cases.  If that always means that there is only one rule that can be applied, then the grammar is LL(k) and can be efficiently parsed in a top down fashion.

We will be seeing examples of this as we go on.

For now, just realize that the theoreticians have looked at all aspects of LL(k) grammars to give us results that we can apply to practical parsing situations.  For example, here are two theorems.

Theorem 1.

No ambiguous context free grammar is LL(k) for any k.

Theorem 2.

No left recursive context free grammar is LL(k) for any k.

Such theorems are useful to us, because we can easily examine some grammars and realize that they cannot be used for efficient top down parsing because they are either ambiguous (hence not good for any programming language grammar) or they are left recursive.