Bottom-Up, LR Parsing

Overview of LR parsing

We can note the following about LR parsing.

LR parsing is deterministic. That is, it can be done with one token of lookahead and no backtracking
No ambiguous grammar can be LR(k) for any k.
LR grammars represent the largest class of context free grammars that can be parsed deterministically (i.e. with no backtracking ever required).
The languages that have LR grammars are a proper superset of the languages that have LL grammars.
Every LL grammar is an LR grammar as well (however, LL grammars are not generally of the form we would choose to use for an LR parser)
Whereas no grammar that is left recursive can be LL(k) for any k, left recursion does not preclude a grammar from being LR.
If a context free grammar is LR(k) for some k, there is an LR(1) grammar that generates the same language.

We have looked off and on at three different grammars for generating the sub-language of all arithmetic expressions in some fictional programming language. Here they are again.

Grammar 1:

<expression> ⟶ <expression> - <expression>
 <expression> ⟶ <expression> + <expression>
 <expression> ⟶ <expression> * <expression>
 <expression> ⟶ <expression> / <expression>
 <expression> ⟶ <expression> ^ <expression>
 <expression> ⟶ identifier
 <expression> ⟶ integer_literal
 <expression> ⟶ ( <expression> )

We recognize now that this grammar is ambiguous, becuase for most strings in the language it allows multiple parse trees. Hence it is neither LL(k) for any k nor LR(k) for any k. This should be pretty obvious. If a string allows multiple parse trees, we would not be able to decide at some points during a parse of that string how to continue building a parse tree deterministically in a bottom up fashion, because there would be more than one choice for the next step in constructing the parse tre that would lead to a correct (but different) parse tree. So, we ignore this grammar for LR parsing as well.

Grammar 2:

 <expression> ⟶ <expression> + <term>
 <expression> ⟶ <expression> - <term>
 <expression> ⟶ <term>
 <term> ⟶ <term> * <factor>
 <term> ⟶ <term> / <factor>
 <term> ⟶ <factor>
 <factor> ⟶ <factor> ^ <primary>
 <factor> ⟶ <primary>
 <primary> ⟶ identifier
 <primary> ⟶ integer_literal
 <primary> ⟶ ( <expression> )

This grammar is not LL(k) for any k, because it is left recursive. However it is not ambiguous. It is also an LR(1) grammar. That is, we can parse strings in the language of this grammar in a bottom up fashion using just one token of lookahead.

Grammar 3:

 <expression> ⟶ <term> <expression_tail>
 <expression_tail> ⟶ + <term> <expression_tail>
 <expression_tail> ⟶ - <term> <expression_tail>
 <expression_tail> ⟶ ε
 <term> ⟶ <factor> <term_tail>
 <term_tail> ⟶ * <factor> <term_tail>
 <term_tail> ⟶ / <factor> <term_tail>
 <term_tail> ⟶ ε
 <factor> ⟶ <primary> <factor_tail>
 <factor_tail> ⟶ ^ <primary> <factor_tail>
 <factor_tail> ⟶ ε
 <primary> ⟶ identifier
 <primary> ⟶ integer_literal
 <primary> ⟶ ( <expression> )

Finally, we recognize this grammar as the LL(1) grammar for arithmetic expressions that we have often used as an example. It, too is LR(1). However, it is not in a convenient form for LR parsing, so we use grammar 2.

An example of LR parsing. In class we attempted to parse the string

identifier + identifier * integer_literal

bottom up intuitively. Here is what we found

We start with the tokens and work our way back up to the start symbol as we parse.
At each step we look at what we have parsed so far to see if we can find a right hand side.
When we find a right hand side we can reduce it to its left hand side as we work our way up the tree.
The issue is that we want to use the current lookahead symbol to help decide when it is time to reduce the right hand side of some rule to its left hand side. At any point we may see more than one right hand side of a rule, or we may find none. The lookahead symbol should be able to guide this process.

The basic steps in the process are these:

Using the current lookahead symbol, decide whether to shift the current lookahead symbol to the next leaf of the parse tree being constructed and move the file pointer to the next token to be the new lookahead symbol.
Reduce a right hand side found in the parse tree under construction to its left hand side.

We learned that we could do this for the string

identifier + identifier * integer_literal

as we worked our way through the parse tree bottom up.