Error Detection in the Scanner
Scan Errors
There are three aspects to scanner errors:
- detection
- reporting
- recovery
- correction
Detection
Scanner errors are detected when the scanner cannot find a token when the get_token method is called. If we consider the scanner to be one giant FSA that starts with the dispatcher, then a scanning error occurs when no accept state is encountered before scanning for the current token is terminated (by an "others" branch on the dispatcher or one of the fsa's). In other words, during the current scan, the string read contains no prefix that is a valid token. At this point, the scanner must set the token to error, so that the parser realizes that a scanning error has occurred. One thing the parser should do is turn off the translation part of the compiler, because there is no reason to continue to produce a translation if the source code contains errors.
Recovery
Most compilers do not quit when the first error is encountered. Rather, the compiler "recovers" from the error in some fashion and tries to continue scanning and parsing (but not generating intermediate code) in order to give the programmer as much information as possible about mistakes in the program. For scanning, recovery means that the scanner must
-
set the error token
-
leave the file pointer in a location that allows the next call to get_token to function in a reasonable fashion.
-
set the row and column variables properly
What options are there?
- Skip all remaining characters from the end of the error string to the first white space character or EOF, whichever comes first. This approach assumes that the most likely place to find a valid token after an error is after the next sequence of white space characters.
- Suppose that the dispatcher is able to dispatch to a valid fsa. If an error then occurs it will be on an "others" branch in an fsa before an accept state is encountered. If this occurs, set the file pointer to the character that took the fsa down that "others" branch. That way, all of the characters leading to this one are skipped and scanning will resume with this character.
- Once the error has been detected on one of the "others" branches of either the dispatcher or an fsa dispatched to, set the file pointer to the character immediately following the dispatch character. Perhaps only that first character is in error and the second starts a valid token.
Of these options, number 3 is the one least likely to miss a valid token.
Reporting
There are various places that scanner errors can be reported. One easy way to report most scanner errors is to leave the reporting to the parser, if it is acting as the driver for the scanner. In this case:
-
set the token to a special error token, such as runon_comment
-
be sure the lexeme contains the offending string that scanned improperly
-
be sure the row and column numbers give the starting point of the offending lexeme in the source code
If these things are done in the scanner, then the parser, when it receives an error token, can print an appropriate error message before calling the scanner again to get the next token.
Correction
It is possible to try to correct the code during error recovery. (This is actually more of a possibility in the parser, where we know what kind of tokens we are looking for). Error correction is not a problem if the intent is to try to recover from the error and try to continue scanning and parsing in some reasonable fashion in order to give the programmer as much information about errors as possible. Error correction is a problem if the intent is to try to actually generate a running translation of the program based on attempts by the compiler to correct the source program. This would lead to a state where a program might translate and execute, but the executable translation is not what the programmer intended. Or, even if the translation executes the way the programmer intended, the fact that the source program has errors in it is still unsatisfactory.
Attempts at error correction and translations based on the corrected source code have been made. PL/I had such a compiler at one point.