Parsing - An Intuitive Overview

Parsing

Definition

Parsing is the activity of determining whether an input string (e.g., a program, an English sentence, etc.) has the proper syntax (form).

Natural Language Examples

Consider the following string.

Joan ate the ball.

All of us recognize the individual words fine (that is, we scan this string just fine). We also parse it fine; that is, we recognize that has the right form for a sentence in English. We might wonder why Joan ate the ball, but we are not at all confused by the form of the sentence. We could translate this sentence into another language, say German, and capture the same meaning:

Joan hat den Ball gegeßen.

On the other hand, consider the next string.

The ate Joan ball.

In this case, we still scan the string fine. We recognize that the individual lexemes--words--represent valid tokens, for example, definite article, verb, noun, noun, but we can't parse this string of words. That is, we recognize immediately that there is something wrong with the form of this string, that it is not a sentence. And since there is something wrong with its form, we can't make sense of it (that is, we can't figure out its semantics---what it means). Are there words missing? Did the ball eat Joan ("Bad ball!")? Do we care any more? We certainly wouldn't try to translate it into another language, because we can't capture its meaning in English, so why try in another language? Another example of a string that scans fine, but does not parse is

ball the the a funny merrily a a a barf.

To help see another twist on the situation, consider the following string:

Curious green ideas sleep furiously.

In this case we both scan and parse the string properly. It "feels" right. The adjectives, nouns, adverbs, and verbs all seem to be in the proper locations. However, the meaning eludes us. Still, because it parses properly (has the right form) we could probably translate it into another language (and the people of that language could puzzle over it).

Extension to Formal Languages

Formal languages are languages that have a strict, formal definition. Programming languages are that way. Natural languages are not. Not only do they have ambiguities that are difficult to formalize, they also are dynamic and change constantly. Following the examples from English above we can make the following observations about programming languages:

A program may scan fine but not parse (not have the correct form)
If a program does not parse (does not have the right form) there is no reason to translate it into the target language, because we can't be sure of its meaning in order to make a proper translation.
If a program does parse (does have the right form), we always translate it into the target language. The intentions of the programmer are the programmer's responsibility, not the compiler's. Indeed, it is impossible for the compiler to determine what the programmer meant in any case.

The Process of Parsing

How does one go about parsing an input string. That is, how does one determine whether an input string has the correct form? To make this possible, there must be a standard set of rules describing the syntax of strings in the language. These rules are generally given in the form of a grammar. Parsing, then, is the process of determining whether a particular string has been constructed according to the rules of the grammar. Recall from your study of computer science theory what a grammar is.

Definition

A grammar G = (N, T, P, S) is a formal definition of a language in which

N is a finite set of nonterminal symbols

T is a fintie set of terminal symbols (which we call tokens in compiler theory)

P is a set of grammar rules (sometimes called productions)

S is a special nonterminal symbol called the start symbol.

So, for the an English grammar, N, the set of nonterminals would look something like:

N = {<sentence>, <subject>, <noun phrase>, <verb phrase>, <direct object>, <indirect object>, ...}

T, the set of terminals would look something like

T = {definite_article, indefinite_article, noun, verb, adjective, adverb, comma, period, ...}

The set of lexemes would be all words in a standard dictionary as well as punctuation. In your study of formal languages, you did not assign lexemes to tokens, because that was not necessary for the abstract study of grammars and languages. The set P of grammar rules would have rules of the form

<sentence> ⟶ <noun phrase> <verb phrase> .
<noun phrase> ⟶ noun
<noun phrase> ⟶ definite_article noun
<noun phrase> ⟶ adjective <noun phrase>
<verb phrase> ⟶ verb <direct object>
<verb phrase> ⟶ verb<indirect object>
        .
        .
        .

The start symbol for this grammar is

<sentence>

To see whether a string such as "Joan ate the Ball." is syntactically correct, one must see whether it follows the rules of the grammar. This can be done as follows:

Try to construct a parse tree with the start symbol as the root.
Continuously apply rules to the nonterminals in the tree in an attempt to recreate the input string.
If a tree can be produced in which all of its leaves are terminals and, when read left to right, the lexemes corresponding to the terminals match the input string, the string has been successfully parsed.
If the parse tree cannot be completed, the string is not valid.

A parse for "Joan ate the ball." might look like

                 <sentence>
                    /     \ \___________
                   /       \            \
          <noun phrase>   <verb phrase>  .
                 /            /  \
                /            /    \
              noun         verb <direct object>
             (Joan)        (ate)    /         \ 
                                   /           \
                              definite_article  noun
                                  (the)        (ball)

where the strings in parentheses are the lexemes corresponding to the tokens. Basically, in this instance, the scanner would scan for individual words, then look them up in a dictionary to see whether they were nouns, verbs, articles, or just what. Of course, in English, some words fit more than one category (the word "running" could be a lexeme that matches a noun, verb, or adjective token), which is one of the many reasons that English is so difficult to define formally with a grammar.

Parsing Programming Languages

The problem with English is that it is very complex and dynamic (it changes all of the time) and cannot be captured in grammar form, particularly in context free grammar form. Programming languages, on the other hand, are designed specifically to be representable by context free grammars. The usual forms for defining a programming language are:

BNF --- Backus Naur Form, named after John Backus and Peter Naur, who developed this formalism for defining one of the first carefully constructed programming languages: ALGOL.
EBNF -- Extended BNF, extended with special notations to make the rules for a programming language easier to write and read by humans.
CFG --- Context Free Grammars, developed by Noam Chomsky as a formalism for studying natural languages. It turns out that CFGs are equivalent to EBNF and that they are the form most useful for understanding parsing.

Consider the following context free grammar for a programming language:

G = (N, T, P, S), where

N = {<program>, <statements>, <statement>, <expression>, <id_list>, <id_list_rest>, <expression_list>, <expression_list_rest>}

T = {r_begin, r_ end, r_read, r_write, period, eof, comma, id, assign_op, semicolon, add_op, mul_op, integer_literal, l_paren, r_paren}

P = {

<program> ⟶ r_begin <statements> r_end period <statements> ⟶ <statement><statements> <statements> ⟶ ε <statement> ⟶ id := <expression> semicolon <statement> ⟶ read ( <id_list> ) semicolon <statement> ⟶ write ( <expression_list> ) semicolon <expression> ⟶ <expression> add_op <expression> <expression> ⟶ <expression> mul_op <expression> <expression> ⟶ id <expression> ⟶ integer_literal <expression> ⟶ ( <expression> ) <id_list> ⟶ id <id_list_rest> <id_list_rest> ⟶ , id <id_list_rest> <id_list_rest> ⟶ ε <expression_list> ⟶ <expression> <expression_list_rest> <expression_list_rest> ⟶ , <expression> <expression_list_rest> <expression_list_rest> ⟶ ε

}

S = <program>