Efficient, Deterministic Parsing
Objective
The objective of parsing is to determine whether a given token input file represents a valid program in the programming language being compiled.
How would you do this if you were not given any directions? Think about this for a while, and you will see how much we owe to the theoretical underpinnings of formal languages and compiling done by others.
Means
A way to meet this objective is to construct a parse tree for the input program using the rules of the grammar for the programming language. There are two primary ways to build such a tree:
- Top down: Start with the start symbol of the grammar as the root of the tree and expand the tree a step at a time. At each step, examine the leaves of the current tree from left to right, and always expand the leftmost nonterminal in the tree by applying a grammar rule to it (this approach is also referred to as a leftmost derivation). When all the leaves are tokens, and the tokens match the tokens of the input file from left to right, the tree represents a successful parse of the program. If at some point there is no way to proceed with the parse, or if the leaves of the final tree do not match the input token file, the tree does not represent a successful parse of the input token file.
- Bottom up: Start with the file of input tokens as the leaves of a potential parse tree. Contract the tree a step at a time. At teach step, look at the current state of the upper level nodes in the tree (the nodes that have no parent). Find a right hand side of some rule in the grammar in these nodes, and join them to a new parent node that is the nonterminal on the left hand side of the rule. If the point is reached where the only upper level node in the tree is the start symbol for the grammar, the tree represents a successful parse of the input token stream. Otherwise, the tree cannot be finished and does not represent a successful parse.
Grammar Requirements for Practical Parsing
- No ambiguity: Ambiguous grammars are out of the question, because they allow two or more parse trees for some strings in the language. This would mean that there would be two different ways to parse some programs, which in turn would result in two or more possible different translations of those programs, each with different meanings (different outputs).
- O(n) time parsing: The run time of the parser should be no worse than linear, otherwise, the time to compile ever larger programs would be prohibitive. This requirement has implications for the parser.
- No backtracking: If we allowed a parser to try a series of rules in building a parse as a guess, then we would have to allow the parser to remove that series of rules and start with a different series. This would lead to worse than O(n) time complexity.
- Constant lookahead range: If, in order to determine which rule to apply at each step, we had to examine an arbitrary number of tokens (e.g., half of the remaining tokens), we would again be approaching O(n2) time for the parse. So, we will have to fix the number of tokens the parser is allowed to use as lookahead for determining which rule to apply at any step.
- No backtracking: If we allowed a parser to try a series of rules in building a parse as a guess, then we would have to allow the parser to remove that series of rules and start with a different series. This would lead to worse than O(n) time complexity.
- Enforce operator precedence in expressions: Arithmetic, Boolean, and and other (e.g., string) expressions involve operators that normally have different precedence levels. These precedence levels should be enforced by the grammar.
Practical Top Down Parsing
Given the above requirements, we can express what property we need a programming language grammar to have if we intend to parse programs in a top down fashion. It is this:
- Unique rule identification given just the next token from the scanner that hasn't yet been attached to the parse tree. That is, when we are about to expand the leftmost nonterminal in the parse tree we are constructing from the top down, just one token from the scanner is all that is needed to determine which rule to use for expanding that leftmost nonterminal (or whether there is no rule to apply so the parse canot proceed). We do not want to look arbitrarily far through the tokens to try to figure out which rule we should apply at any point.
The Big Question
The big question is this: Are we asking too much? It would make parsing nice if we could build a top-down parser that could build a parse tree for any input program(or recognize at some point during the parse that no parse tree exists for that purported program) that had all of the above properties. Are there such grammars for real programming languages, or are the restricitons--no ambiguity, no backtracking, and one lookahead token that always predicts which, if any, rule applies to the leftmost nonterminal in a top down parse--just so restrictive that no real programming language could have such a grammar?
The answers to these questions are, "No, we are not asking too much," and, "Yes, there are context free grammars for most programming languages that have these kinds of properties (with a few minor exceptions)." How do we know? Well, once again we have to doff our virtual hats to the theoreticians who explored and developed the theory of grammars with these properties. We will lean heavily on the theory they developed.
Let's look at an example.
An Intiuitive View
Let's see how top-down, deterministic parsing with one token of lookahead might go, given the proper grammar. We will use grammar 3 from last time.
1 <expression> ⟶ <term> <expression_tail> 2 <expression_tail> ⟶ + <term> <expression_tail> 3 <expression_tail> ⟶ - <term> <expression_ail> 4 <expression_tail> ⟶ ε 5 <term> ⟶ <factor> <term_tail> 6 <term_tail> ⟶ * <factor> <term_tail> 7 <term_tail> ⟶ / <factor> <term_tail> 8 <term_tail> ⟶ ε 9 <factor> ⟶ <primary><factor_tail> 10 <factor_tail> ⟶ ^ <primary> <factor_tail> 11 <factor_tail> ⟶ ε 12 <primary> ⟶ identifier 13 <primary> ⟶ integer_literal 14 <primary> ⟶ ( <expression> )
Suppose that two different people are parsing a string using this grammar given only one token of lookahead at any point. Both start with the start symbol, <expression> (for convenience, we will shorten the names of all of the nonterminal to just the first letters of the nonterminals).
Suppose that both people are given the lookahead token id as the first token of the string they are to parse. The first rule they both must apply (rule 1) is shown below
Since id is the first token, it will naturally have to appear as the first leaf of the completed parse tree if the parse is to be successful. You might think that both people used the same rule (1) because it is the only rule with <expression> on the left. It turns out that we can actually determine at this point whether it is possible to even use this one rule. An analysis of the grammar shows that applying this rule can indeed eventually lead to an id as the first token.
The lookahead token has not changed. It is still id. We only get a new lookahead token when the one we have now is attached to the parse tree (and therefore we are not looking ahead to it any more). So both people expand their trees as shown next, in which the leftmost nonterminal. <t>, is expanded as usual.
Here, rule 4 is used and it really is the only choice both have.
The next rule must be applied to <f>. Again both use rule 9.
Finally, both people are now about to expand <p>, the leftmost nonterminal. Both have the same lookahead token (id). Both see three different rules that can be applied, namely 12, 13, and 14. Those are all rules with <p> on the left hand side. But given that the first token in the file is id it is clear that only rule 12 applies. If rule 13 were used, we would be hanging an integer literal leaf on the tree, which doesn't match the first token of the input file (the id). Similarly, if rule 14 were used, a left parenthesis, (, would be placed on the tree in this position which also would not match the id in the token file.
The important point to notice is that even though two different people are parsing two possibly different input token files they have been forced to use the same rules at each step so far based on the fact that the lookahead was the same for both throughout. There were no other choices.
At this point we have hung the id token on the tree where it belongs. To do continue parsing, we need to get the next token from the token file as our lookahead token. Suppose it is the multiplication operator *. With this lookahead both people need to decide how to expand the leftmost unexpanded nonterminal, which in this case is <f_t>. Notice that there are two rules with <f_t> on the left:
10. <factor_tail> ⟶ ^ <primary> <factor_tail>
11. <factor_tail> ⟶ ε
Both people will notice that rule 10 won't work, because it will hang the token ^ on the tree when the lookahead token implies that the next token to hang on the tree is the *. So, both people are left with rule 11 (we will see later that we can determine right now with this grammar whether using rule 11 will ever lead to a * being attached to the tree next).
Once again, both people who are doing this parse without knowing what the other is doing are led to use the same rule. There is no backtracking, because in every instance, only one rule has worked.
So, now the nonterminal to expand is <t_t>. The lookahead token is still * because it has not yet been hung on the tree. Look at the three rules with <t_t> on the left:
6. <term_tail>
⟶
* <factor> <term_tail>
7. <term_tail>⟶
/ <factor> <term_tail>
8. <term_tail>⟶ ε
Using rule 6 will mean that the lookahead token, *, will be attached to the tree at this point, which is what we desire. Application of rule 7 would cause a problem, because a / would be hung on the tree where a * was called for. We can't easily see what would happen if we applied rule 8 (it could be that a * would be arrived at by some expansions following. Without going into details at this point, an analysis of the grammar will show that if we were to use 8, there is no way that a * could be hung as the next token on the tree.
So, here we are again. Both people must apply rule 6 to continue the parse.
Now notice something different. If at this point person 1 were given a lookahead token of id whereas person 2 were given a lookahead token of intlit. Both would apply rule 9 to <f_t>:
9. <factor> --> <primary> <factor_tail>
At this point, though, person 1 would apply rule 12 to <primary>
12. <primary> --> identifier
because he has a lookahead token of id. Person 2 would apply rule 13 to <primary> in her tree,
13. <primary> --> integer_literal
because her lookahead token is integer_literal.
The important thing to notice is that if two parses are being done independently, and both have a given nonterminal to expand, and both have the same lookahead token, in both parses the same rule has to be applied. There are no other choices.
The Abstraction
Given the above considerations for practical parsing in general, what would we want in a practical grammar for top down parsing? Basically, what we need is a grammar that satisfies the following:
- At each point in the top down parse where we are about to expand a nonterminal, we must be able to know exactly which rule to apply to that nonterminal by looking only at the next few tokens in the token file. If no rule applies, then we know that there is an error in the program.
We can envision this by looking at the following general parse tree, illustrated at some point in a top down, leftmost parse. So far, as tokens have come in, the tree has been expanded to the point that tokens w have already been attached to the tree and now the A shown is the leftmost nonterminal that is now to be expanded. The greek a just stands for the rest of the leaves to the right of A, some tokens and others nonterminals.
Given a tree like this and given the next token beyond w, the question is which rule we should use to expand A. If we can determine which rule to use at every point, then we say that the grammar is an LL grammar.
If we have such an LL grammar, then we can parse the string in O(n) time, where n is the number of tokens in the input stream, because at each step we look at the next token and decide exactly which rule to apply to the current nonterminal, and we never return again (no backtracking) to this nonterminal in the tree.
Fortunately for us, the theorists have already answered these questions for us. To do so, they first define these kinds of grammars and give them a name.
Definition
A context free grammar G is an LL(1) grammar if and only if for any two leftmost derivations
* 1 *
S ===> wAα ===> wβα ===> wx
lm lm lm
* 1 *
S ===> wAα ===> wγα ===> wz
lm lm lm
in the grammar, First1(x) = First1(z) implies that β = γ.
This definition is a precise shorthand for what we intend when we say we want a grammar that is efficiently parsable in a top down fashion. First let's establish some notation:
- Capital letters at the beginning of the alphabet stand for single nonterminals. Thus, in the above definition, A is a single nonterminal. In a real programming language grammar, A might be <expression>, for example.
- The start symbol of a grammar is S. In a real programming language grammar, S might be <program>.
- Small letters at the beginning of the alphabet are single terminals (tokens). There are none in the definition, but identifier might be an example of a token in a real programming language.
- small letters at the end of the alphabet stand for strings of tokens. So, w, x, and z are all strings of tokens in the definition. Any one of them may be ε (the empty string). Or they might consist of one or more tokens. For example, w could be ε. On the other hand, w could be
mp_program identifier semicolon var identifier colon mp_integer semicolon mp_beginindicating that these tokens have already been covered in the leftmost derivation so far. The symbols x and z are just the remaining tokens that follow w in the two different programs.
- The * in the derivation means that 0 or more rules have been applied. The lm means that the rules were applied in a leftmost fashion (always to the leftmost nonterminal at the current stage of the derivation).
Given this notation, the definition just shows two different parses using the same grammar, where in both cases the same nonterminal A is about to be expanded. In both cases, the first k tokens of w are the same as the first k tokens of z. In other words, suppose a person is trying to decide which rule to apply to A in either case, and that person only gets to look at the next k tokens and finds them the same in both cases. If that always means that there is only one rule that can be applied, then the grammar is LL(k) and can be efficiently parsed in a top down fashion.
We will be seeing examples of this as we go on.
For now, just realize that the theoreticians have looked at all aspects of LL(k) grammars to give us results that we can apply to practical parsing situations. For example, here are two theorems.
Theorem 1.
No ambiguous context free grammar is LL(k) for any k.
Theorem 2.
No left recursive context free grammar is LL(k) for any k.
Such theorems are useful to us, because we can easily examine some grammars and realize that they cannot be used for efficient top down parsing because they are either ambiguous (hence not good for any programming language grammar) or they are left recursive.