Scanner 3

The Dispatcher

Remember that one important aid to the design and implementation of correct programs is the use of pre and post conditions. That is, at the beginning of every routine to be written, a statement of the pre conditions (the states of relevant variables at the time the routine is called) is given and a statement of the post conditions (the states relevant variables must be in after the routine has executed) is given. These two things should be enough information for any programmer to finish the design and implementation of the routine to accomplish the desired task.

In the previous lecture we gave a diagram of a partial scanner and described what it must do. The diagram is repeated here: The part to the left of the green dashes we refer to as the "dispatcher." The purpose of the dispatcher is to prepare for scanning of the next token. It does this by

skipping any white space in the input file beginning with the current position of the file pointer
examining the first non-white space character encountered without consuming that character (i.e., leaving the file pointer pointing to that first non-white space character)
calling the proper method for handling tokens that can begin with that character

In real compiler projects, the scanning could be done a bit more efficiently than this diagram indicates. For example, we could eliminate the states shown just to the right of the green dashed line, giving the following:

In this case, there is no dispatcher and separate fsa's for each token. This is just one gigantic fsa. Notice that we actually consume the first character of the lexeme for each token in this case. This, as stated, would be more efficient and is a technique you should remember. For your project, however, we want you to do it as shown in the first diagram. The reasons are simple. It is easier to assign a set of fsa's to each team member according to the first diagram. Each team member knows exactly what he or she must do in scanning for the specified token and can contribute the correct code without such tight coupling with the overall diagram..

In the above two diagrams we are assuming that all nondeterminisim in the dispatcher has been taken care of, as in the case in blue above, where the fsa's for Colon and Assign_Op have been combined into a single fsa (because they both start with a colon). Furthermore, we are assuming that all of the fsa's in red have been augmented to handle the practical aspects of parsing, just like the one in blue above. These design components would need to be done before the programming is started.

The scanner as an object

The scanner is best conceived as a class, object, or package, depending on your programming language of implementation. There must be methods in the scanner that allow the parser (driver) to get at necessary scanner values. These include

getNextToken -- a method that returns the next token
getLexeme -- a method that returns the current lexeme
getLlineNumber -- a method that returns the line number in the source code where the current lexeme is found
getColumnNumber -- a method that returns the column position of the start of the current lexeme
initializeScanner(file_name) -- a method that takes the source file name as a parameter and opens that file in the scanner

Other Formulations

There are other formulations that you might be inclined to consider, such as making tokens objects with attributes similar to those above. The organization of your scanner for your project in terms of what classes and objects to include will pretty much be left up to you. We will be making some aspects of project design required. Be sure to consider various ramifications of your choices. For example, if you make tokens objects, you might wind up keeping a token object around for every token, long after those tokens are no longer required, unless you dispose of them after you are done.

The dispatcher as part of method getNextToken

Let's now look at the dispatcher. It is part of method get_next_token.

getNextToken;
     -- pre:  The input file pointer is pointing at the first
     --       character after the last token scanned.

    -- post: The input file pointer is pointing at the first
    --                character after the current token.     
    --       Variable Token contains the name of the token
    --                scanned     
    --       Variable Lexeme contains the string that was     
    --                matched as the token was scanned.     
    --       Variable Line contains the line number in the source     
    --                code where the current token starts     
    --       Variable Column contains the column number in the     
    --                source code line where the current token starts

    skipWhiteSpace;     
    -- The input file pointer is pointing at the first character     
    --     of the current token.      
    -- Variable Line contains the line number in the source     
    --     code where the current token starts     
    -- Variable Column contains the column number in the     
    --     source code line where the current token starts

    Next_Character := Examine(Input_File);     
    -- Next_Character contains the first character of the     
    -- current input token, and the file pointer points to this     
    -- character (i.e., the file pointer has not moved)

  -- The dispatcher starts here
       case Next_Character is 
  -- dispatcher

      when '('                 ==> Scan_L_Paren

        -- scan for a left parenthesis

        -- PRE: The input file pointer is pointing at the first
        -- character of the current token.
        -- POST: The input file pointer is pointing at the first
        --     character after the current token.
        -- Variable Token contains l_paren (the name of the currently
        --     scanned token).         
        -- Variable Lexeme contains a left parenthesis (the string
        --     matching the currently scanned token).

      when ')'                 ==> Scan_R_Paren;

            -- scan for a right parenthesis

      when ';'                 ==> Scan_Semicolon;

            -- scan for a semicolon

      when ':'                 ==> Scan_Colon_or_Assign_Op;

            -- scan for colon or assignment operator

      when 'a'..'z' | 'A'..'Z' ==> Scan_id;

            -- scan for an identifier or a reserved word

      when '0'..'9'            ==> Scan_Numeric_Lit;

            -- scan for an integer, fixed-point, or float literal

      when <eof>               ==> Scan_Eof;

            -- scan for end of file character

      when others              ==> Scan_Error;

            -- handle a scanning error

   end case -- dispatcher;

 end get_token;

Notice the assertions throughout. These comments provide precise specifications for what each line of code they surround is supposed to accomplish. With good, complete assertions, team members can be given various components to complete and be able to finish those components to meet the specifications.

On the other hand, putting these comments in everywhere makes a program unreadable. So, a team needs to write them down separately, so that documentation exists about the specifications for team members to follow. In a real project, such specifications would be determined by management before the project started. We will require that you include pre and post conditions for each method you write. Other commenting requirements will be spelled out as we go along.

The Individual FSA's

Implementation of each fsa can be done in a standard way. Consider the fsa that scans for a colon or assignment operator. It is given in blue above. The code for a procedure implementing this fsa might look something like:

procedure Scan_Colon_Or_Assign_Op;
     -- Pre: The input file pointer is pointing at the first
     --      character of the current token (a colon, in this case).
     -- Post: The input file pointer is pointing at the first
     --                character after the current token.
     --       Variable Token contains the name of the token
     --               scanned
     --       Variable Lexeme contains the string that was
     --                matched as the token was scanned.

  Type State_Type = (S0, S1, S2);

  State : State_Type := S0; -- initialize State to the start state
  Done  : Boolean := False;

  begin
     lexeme := "";
     loop while not Done

      Case State is

        when S0 => -- this is the start state
           Get Next_Character from source file;
           case Next_Character is
             when ':' =>
               State := S1;
               lexeme := lexeme || Next_Character;
             when others =>
                null;
             -- there is nothing in the others clause, because we know
             -- from the dispatcher that the first character read in
             -- state S0 must be a colon.
             -- Language note:  carefully designed languages, like Ada
             -- require an "others" or "default" clause in every case
             -- statement to force the programmer to consider every
             -- possible case.
           end case

        when S1 => -- a colon has been scanned
             Get Next_Character from source file;
           case Next_Character is
             when '='    =>
               state := S2;
               lexeme := lexeme || Next_Character;
               -- the assign_op token has been scanned; scanning stops
             when others =>
               -- the colon token and one extra character have
               -- been scanned; actions are taken to set the
               -- attributes properly and halt the scan
               Token := Colon;
               Done := True;
               reset the file pointer back one
           end case;

        when S2 ==> -- an assignment operator has been scanned
            -- there are no transitions from this state; an assignment
            -- operator (:=) has been scanned; actions are taken
            -- to set attributes properly and to halt the scan
               lexeme := lexeme || Next_Character;
               Token := Assign_Op;
               Done := True;

        end case;
       end loop
     end Scan_Colon_Or_Assign_Op;

Note that in this case the variables Lexeme, Token, Line, and Column are all global variables to the scanner module. In Ada and other non-object oriented programming languages,, this use of global variables is proper, because they are part of the state of the scanner that is continuously accessed by the driver. Judicious use of global variables is proper. You must be careful with them, however. In the case of C++ and Java, these are not really called global variables, but rather attributes and are part and parcel of the concept of object oriented programming.

In general, this form for implementing a finite state automaton has the following features:

The set of states for the fsa is declared
The current state is set to the start state of the fsa
A loop terminating variable, "done," is set to false
A loop is constructed that terminates when variable "done" is true
The first statement in the loop is a case, or switch, statement that sends execution to the code associated with the current state.
In the code for each state the next character from the input is read, and another case, or switch, statement is implemented that sets the next state properly based on the current state and the current input character. The current lexeme being scanned is also updated with the current character read as appropriate.
Others clauses take care of the special features added to the fsa's to account for the practical considerations of scanning, such as moving the file pointer back if too many characters have been read, updating the lexeme, and setting the token value properly.