Milestone 3 - Completing the Scanner
Due at the beginning of your next lab period
Each team member is to your scanner in your chosen implementation programming language. The scanner should be written according following the outline covered in class. If your team is would like to consider doing a variation, you must clear it first with the instructor.
Public Scanner Methods
The driver (main method and eventual parser) must have access to the following methods:
openFile
Remember that your program must run on esus regardless of the platform on which it is being developed. In this case, the main program or method should be the driver, which should start executing with something like Linux command line
mp source-code-file.pas
where source-code-file is a command-line parameter that is the filename (actually the path name of that file relative to the directory in which the executable mp exists) of the text file the scanner will be scanning.
getToken
This method is to return the next token in the input file. Tokens can best be implemented as an enumerated type.
getLexeme
This method is to return the current lexeme, namely the string that was matched when the current token was scanned.
getLineNumber
This method is to return the line number in which the first character in the lexeme for the current token was scanned.
getColumnNumber
This method is to return the column number of the first character of the lexeme for the current token.
Driver Output
Design the driver to continuously retrieve tokens from the scanner and to print a token file. The token file is to be a standard text file with one line per scanned token containing the following information in this order, spaced nicely so that the output is easy to read, as given below.
token 1 line number 1 column number 1 lexeme 1 token 2 line number 2 column number 2 lexeme 2 . . . token N line number N column number N lexeme N
where token 1 is the first token scanned, line number 1 is the number of the line on which the token was scanned, column number 1 is the column on that line where the token begins, and lexeme 1 is the lexeme corresponding to the token, and so on for each line. (Notice that scanner errors are not handled at this time.) For example, the first two lines of the output file might read
mp_begin 1 3 begin id 1 9 Number_Of_Students
Scanner Design
You must implement the scanner according to the design given in the lecture unless you have cleared a variation with the instructor. The scanner should have a standard dispatcher that first skips white space and then examines (but does not consume) the first non-white space character to select the proper finite state automaton for scanning the token.
Implement each finite state automaton (augmented for practicality) as a separate method.
The precondition for each finite state automaton is to be:
- The file pointer points the first character of the prospective token.
The postcondition for each finite state automaton is to be:
- The file pointer points to the first character following the lexeme scanned, and
- attribute token is set to the name of the scanned token, and
- attribute lexeme is set to the string that was matched for this token
- attribute lineNumber is set to the number of the source program line on which the first character of the lexeme was found, and
- attribute columnNumber is set to the number of the column of the line that holds the first character of the lexeme scanned.
Each finite state atutomaton is to use the structures given as options in class for doing so, and each team member is to implement their FSA's in the selected fashion so that the code all has the same appearance.
Notes
In Pascal,
- lexeme length is unbounded
- use the list of tokens posted on the class resource page to complete the scanner
- use the standard names of the tokens given for the list of tokens
- remember that Pascal is case insensitive. So when you scan an identifier, you must keep the same capitalization in the corresponding lexemes, but you can store all reserved word lexemes in the table of reserved words in lower case, and then when looking an identifier lexeme up in that table you can use your programming language's string toLower() method or equivalent to change the scanned lexeme to all lowercase for the check.
By the time this assignment is done, your scanner should be completely functional, ready to work with your parser (to come).
Your scanner should also:
- Handle comments.
- Comments begin with { and end with }.
- Ignore comments while scanning
- Watch for runaway comments and set the token value to the error token MP_RUN_COMMENT if this occurs (if the end of file is encountered before the end of a comment is reached, a runaway comment error has occurred).
- Be sure to update the line and column numbers properly while scanning comments
- Handle strings.
- Be sure that you compute the lexeme to be only the string between the opening and closing apostrophes (do not include these apostrophes).
- Wherever two adjacent apostrophes are found in a string being scanned AFTER the first apostrophe, one apostrophe should be discarded in the lexeme (be sure to keep the column number correct by still counting any discarded apostrophe)
- Watch for runaway strings (if you encounter the end of a line before the closing apostrophe is found, the string is a runaway string).
- Set the token value to the new error token MP_RUN_STRING if a runaway string is found.
- Report scanning errors.
- If the dispatcher cannot dispatch to an fsa because the first character for the current token scan does not start any valid token, the token value is to be set to the error token MP_ERROR.
- If a valid token cannot be scanned (the fsa scanning that token does not pass through any accept state before an invalid character is found, as described in the lecture) the scanner is to set the token value to MP_ERROR. The lexeme should be the invalid character at the start of the scan, and the row and column numbers should indicate where that character was found. You should also print a meaningful error message in the driver based on the information in the lexeme, row number, and column number.
- Recover from scanning errors.
- If a runaway comment is encountered, the driver should print an appropriate error message, noting where the comment started. The scanner should leave the file pointer pointing to the end of file character. Then, when the driver calls the scanner again, the scanner will return the end of file token, and the driver will terminate as it usually would upon receiving this token.
- If a runaway string is encountered, the driver should indicate where the string started and give an appropriate error message. The file pointer should be left pointing at the end of line character that terminated the runaway string.
- For other scanning errors, the scanner should leave the file pointer pointing to first character after the character examined by the dispatcher.
- Your program must be able to create a printout as shown above.
- In this case, error messages are to be listed in this same file on separate lines in between the lines for valid tokens. The error messages should indicate the line and column numbers of where the error was encountered.
Optional
Produce a source listing of the program you are scanning so that the program looks just like it is entered by the programmer. As errors occur, they should be noted by inserting an error line right below the source line with the appropriate error message and a mark (^) pointing to the start of the problem. You can extend your scanner to do this, but it is not a requirement.
Special requirements
- If you find any discrepancies or errors in the assignment, be sure to report them immediately, so that the page can be updated appropriately.
- You should be developing some good test cases for your scanner. You can post an announcement on the news group that you are willing to trade test cases with others who have developed some.
To Turn In
- Turn a PDF file of your scanner source code into the Milestone 3 dropbox by 3:00 pm on Friday, January 31.
- Ensure that your scanner is ready to do an initial demo by the start of lab on Friday, January 31.