Parser Error Handling and Recovery
Keeping Things Straight
As we have mentioned , there are three things we need to know at least a little bit about as we begin implementing the semantic analysis phase of a compiler. If we were not simultaneously learning about compiling and doing a project, we could perhaps study these things in succession. However, since we are writing a compiler at the same time as we are learning about compiling, we need to get at least an overview of the main issues to proceed.
The three things we have already mentioned are
- symbol table construction -- the structure used to keep track of identifiers and their attributes during compilation so that translation into IR can be done properly.
- run time memory management -- a methodology for determining how to access variable values in memory so that code can be generated during the translation phase will be able to include correct addressing for variables in memory.
- translation of source code to IR -- the creation of a new version of the source program in a form that is closer to target machine language.
- error handling at all phases of the compiler
The symbol table is needed to hold semantic information about identifiers encountered during the program, including their lexeme, type (e.g., integer, real, array), kind (e.g., variable, constant, etc.), and other attributes (e.g., parameter). Scoping is represented by, for example, having a different symbol table for each scope and maintaining these as a stack of symbol tables as parsing progresses, pushing the new symbol table when a scope is entered and popping the symbol table when the scope is exited. One thing that is needed for eventual translation is the memory location where each identifier represented in the symbol table will eventually be stored when the translated program is run on the computer.
Run time memory management gives us a model for memory that allows us to finish construction of the symbol table. In other words, this model lets us envision where each identifier will be eventually represented in memory. So, with this model in mind, we can include an entry in the symbol table for each identifier that indicates where that identifier will be in memory, so that we can generate the right addresses during translation. The most common memory management model uses a run time stack for managing variables. Each scope has its own activation record, which is pushed onto the stack when the scope is entered, and popped off when the scope is exited. Variables are located at fixed offsets from the start of the activation record for the scope.
In translating the source code to IR, the semantic analyzer uses the symbol table to do type checking (to see that identifiers have been declared and used properly), and does the translation using the run time memory address information stored in the symbol table to generate correct addresses for the identifiers.
Error handling is performed at all stages of the compiler: in the scanner, the parser, and the semantic analyzer.
Error Handling
Error handling refers to two things: detection of an error and recovering from an error. There are three points at which errors can be encountered and resolved:
- During Scanning
- During Parsing
- During Semantic Analysis
Once semantic analysis is complete, errors should not occur, because an entire translation of the original source program into equivalent IR form has been accomplished without error. There should be no possibility of encountering errors when translating IR into machine code. This does not, of course, mean that the translated program is correct. That is still the responsibility of the programmer.
Recovering from Scanner Detected Errors
The method we have applied for recovering from scanner errors is to raise an exception that is noted by the parser and then restart scanning with the next input character following the one that started the string in error when the scanner is called again. Another approach discussed was to restart the scanning with the character immediately following the token that the scanner got stuck on (i.e., skip to the next white space character). The first method seems better, because we may be able to scan more valid tokens. In any case, we
- encounter an error
- report the error
- recover from the error using one of the methods discussed (the one we chose for the project was to place the file pointer at the character just after the one dispatched on if a scanner error is found)
- set a flag to turn off IR generation
The last point is important. There is no reason to be generating a translation of a program that has an error in it.
Recovering from Parser Detected Errors
In general, there are three approaches:
Catastrophic Mode: In this case, when an error is detected, an exception is raised, a message printed about where the error occurred (e.g., line and column number) and what it might be (e.g., "expecting the start of a statement, but found a semicolon"), and the entire compilation process is terminated.
Panic Mode: In this case, when an error is detected, the parser raises an exception that causes an error message to be printed about where the error occurred and what it might be (as discussed under Catastrophic Mode), and also causes scanning to proceed to some delimiter, such as a semicolon, at which point parsing continues with the new construct. Just one delimiter can be chosen (e.g., the semicolon) as the synchronization token, or for more sophisticated approaches, a set of synchronizing tokens can be established based on the construct that is being parsed at the moment. For example, if an error is detected during the parse of an assignment statement, parsing would begin again with the statement following the next encountered semicolon if the semicolon is the synchronizing token. In this case, exceptions should be raised (thrown) and handled (caught) in the procedure where the synchronizing token is found so that parsing can be restarted at that point.
Non-Panic Mode: In this mode, the parser pretends not to panic quite so much and attempts to restart the parse at some point sooner than the next delimiter (after raising an exception and posting an appropriate error message).
Error Repair Mode: In this instance, the parser actually tries to repair the code so that the construct in which the error was found can continue to be parsed. Of course, the repair is likely not correct. The point is not to produce a correct program (although there were early attempts to do this) but to be able to continue the parsing process and be able to parse as much of the program as possible. So, for example, a common repair technique is to insert a semicolon wherever it might appear that one is missing.
Example of Panic Mode Error Recovery
Suppose you are parsing the statement
a := b * 4 d * (a + b);
In this case there appears to be an operator missing between c and 4. In this instance, our parse would look something like
Notice that we are trying to expand <factor_tail>, but the lookahead is an id. This, according to our LL(1) table, is impossible, so we have an error.
In catastrophic error recovery mode, our parser would simply print the line and column number where the error was encountered along with an appropriate message (e.g., "expecting an arithmetic operator or a semicolon, but found an identifier") and then quit.
In panic mode error recovery, we basically throw this statement away and try to parse the next statement instead. In other words, we back up to the most recent <statement> nonterminal in our tree, skip past all tokens until we see the semicolon, and then restart the parse from <statement> according to the rule
<statement_tail> ⟶ ; <statement> <statement_tail>
(Of course, this only applies to this grammar.)
In our parser, this same thing can be accomplished by rasing a Syntax_Error exception any time while parsing from procedure Statement. This exception is then propagated back up until procedure Statement is reached. At that point, the exception is handled by printing a proper error message, such as
Expected an operator or delimiter, but found an identifier at line 35 column 33. Skipping until after next semicolon.
Then the exception handler would repeatedly call the scanner until the next token after the next semicolon is reached. Finally, the exception handler would then make sure that procedure Statement begins afresh at this point with this new lookahead. So, if we were parsing the three statements
. . . read (a, b, d); a := b * 4 d * (a + b); write(a); . . .
the read statement would parse properly. The parser would then encounter an error in processing the assignment statement (as shown above). If the panic mode error recovery is used and the synchronizing token is the semicolon, the tokens d, *, (, a, +, b, and ) would be scanned and skipped. Parsing would then restart at the semicolon in the above rule for statement tail, and the call to <statement> would then result in a proper parse for the write statement.
Cascading Errors
All of the error recovery modes except the catastrophic mode can lead to a phenomenon called "error cascading." For example, in attempting to restart a parse at some point after an error is encountered may lead to yet more errors, because a proper restarting point was not chosen. Error repair is often not correct, either, yielding more errors as an incorrectly repaired construct is further parsed. As an example, consider an assignment statement with a missing operator:
Malt := Oranges + Grapes Sugar * Yeast;
In this case, catastrophic mode would terminate parsing after the scanner returned the ID Sugar. Panic mode with scanning restarting after the semicolon would work pretty well in this case. Non-panic mode might restart parsing with Sugar, assuming that Sugar is the start of a new statement, which would lead to cascading errors (because then it might try to restart with Yeast after the next error is encountered, and so on). Error repair might try to fix the construct by inserting a semicolon after Grapes and treating Sugar as the start of a new assignment statement, again leading to cascading errors.
Semantic Errors
Semantic errors are those encountered when a semantic action routine is running. The most common of these are errors related to the symbol table such as attempts to declare an identifier twice, or referring to an identifier that has not been declared. Probably the best way to handle such errors is to raise an exception that is propagated all the way out to the level in the parser where the semantic information of this semantic action (that now is in error) is no longer needed. This can also be understood by thinking of semantic records. Suppose that in processing an an arithmetic expression an undeclared variable is encountered. In this case, instead of passing back a semantic record with the lexeme corresponding to the variable, an error semantic record is returned. This error record means that IR cannot be produced for the operation that this variable was a part of, so instead of an intermediate value being returned as the semantics of the operation, an error record is further returned. If this expression is part of an assignment statement, then the error record continues to be passed back until the parser moves on to parsing the next statement (where no semantic record is needed anyway). Consider the example
X := A + B * C;
where C is an undeclared variable. In this case, the action #process_id(ID_Rec) would discover this fact and return an error in ID_Rec (that is, it would return the variant of the record that is an error record, or it would accomplish the same thing by raising an exception). The semantic action for gen_infix should at this point be called by the parser to do B*C. However, with one valid semantic record and one error record, gen_infix would just return an error record itself, representing the intermediate value for B*C. Similarly, gen_infix for A+T, where T is the semantic record for the temporary value (in this case, an error semantic record) would return an error record. Finally, gen_assign would have a semantic record for X and the error semantic record for the right hand side, and so it, too, would not generate code. In this case, no semantic record is returned from gen_assign, so the parser just continues, by parsing the next statement normally. Similar things can be accomplished with exceptions, in which the exception is passed back until the parser starts on the next statement.
Turning Off IR Generation
Regardless of where an error is encountered, in the scanner, the parser, or the semantic analyzer, it is best to turn off IR generation even though the parsing process continues (as well as static semantic checking, such as building the symbol table and checking whether variables are declared before use), because the generated IR would likely be incorrect. Other approaches include continuing to generate IR, but deleting the IR file before it can be run through an interpreter, or not allowing the code generation phase to proceed.