Symbol Tables—A Look Ahead
Where We Are
We are now at the point in the class represented by this partial diagram of a compiler:
Source Token Parse --------> Scanner -------> Parser -------> File File Tree
As we all know, we haven't actually been generating a parse tree. However, the parse tree is easily determined in your recursive descent parser. If at every point in your parser where you are expanding a rule you print that rule number to a file, you have a record of the rules you used. If you then print that file out in order, you can easily construct the parse tree by starting with the start symbol and applying each rule in order to the leftmost nonterminal in the tree. That is, the file of rules you printed is a leftmost parse of the string you have just parsed. For example, suppose that the following three rules were the only ones in our grammar with <expression_tail> on the left:
3. <expression_tail> --> + <term> <expression_tail>
4. <expression_tail> --> - <term> <expression_tail>
5. <expression_tail> --> ε
Then the procedure in a recursive descent parser for <expression_tail> would have the following form. Note the new statements (in red) that have been included to print out the rule being applied.
procedure Expression_Tail case Look_Ahead is when "+" ==> -- 3. <expression_tail> --> + <term> <expression_tail> Put(Parse_File, 3) Match("+"); Term; Expression_Tail; when "-" ==> -- 4. <expression_tail> --> - <term> <expression_tail> Put(Parse_File, 4) Match("-"); Term; Expression_Tail; when ")" | eof ==> -- 5. <expression_tail> --> ε Put(Parse_File, 5) null; when others ==> Syntax_Error; end case; end Expression_Tail;
If parse file were to be printed at the end of the parse, the numbers would represent a leftmost parse of the input string. So, we really do have access to the actual parse tree if we desire it.
Where We Are Going Next
So, now that we have completed parsing, where do we head next? We must begin the process of capturing the meaning of the programs we parse so that we can convert them into a different form, which we usually call Intermediate Representation, or IR. There are two distinct pieces in this process:
- Symbol table creation
- IR code generation
Our diagram now looks like:
Source Token Parse IR --------> Scanner -------> Parser -------> Semantic Analyzer ------> File File \ Tree / File
\___ ___/
\ / Symbol Table
Symbol Tables
Why We Need One
When we parse a string like
Fred := Mary - 5;
in order to determine whether it parses ok, we need to know whether Fred and Mary have been declared or not and what their types are.
The point at which these variables have been declared (if they have been) is in a VAR statement, which is arbitrarily far away from the current line being parsed. That is, this assignment statement should only be parsable in the context that Fred and Mary have been declared to be of type Integer. Keeping this information around in grammar rules would require a context sensitive grammar rather than a context free grammar, which we use for defining programming languages.
So, why don't we use a context sensitive grammar rather than a context free grammar to define programming languages? They are too unwieldy and unreadable. Furthermore, most of a programming language can be represented just fine by a context free grammar. Where context information is needed, we use a different programming technique: the symbol table. The symbol table contains all names (e.g., variable names, function names, and procedure names) and their attributes.
The Symbol Table ADT
We will be seeing in more detail how to deal with semantic records and actions in general, but for now, we are going to focus on the symbol table, because we will need to deal with it first. Let's go through an exercise of determining how to build a symbol table abstract data type. Every ADT has a set of operations and a data structure.
Operations
The operations we are likely to need are
- insert
- lookup
- return attributes
- create symbol table
- destroy symbol table
Data Structure
We will be needing a separate symbol table for each scope (we'll talk about this later). Each symbol table would have the same structure:
- sequential list (array of lexemes and attributes)
- sequential list (linked list of lexemes and attributes)
- hash table
- binary search tree
Straight binary search on an ordered list of identifiers would not work well because the list would first have to be sorted, since the identifiers are inserted in a random order. Each of the above data structures implies a different time complexity for the operations. Note again, though, that if the ADT is carefully specified, the internal data structure can be changed to a more efficient one without changing the external operations that are used.
Inserting Items Into the Symbol Table
Semantic Actions
The general approach for the entire enterprise of generating semantic information (inserting names into a symbol table or generating code) is done as follows. Here we use the term semantic action to describe a call to some method to perform the desired action, such as inserting a name into the symbol table.
The approach to including semantic actions is the following:
- Examine the right hand side of each rule in the grammar
- At each point where it is clear that if we have parsed this far in the right hand side we know enough information to carry out an action, insert an action symbol in the rule at that point.
For example, suppose that we had the following rule in our grammar:
<variable declaration> --> id : <type> ;
corresponding to a statement such as
Fred : Integer;
in the programming language. At the point where we had matched the id and returned from processing <type> we would know enough information to insert the id and its type into the symbol table. We could do that before processing the semicolon. So we could modify that rule to show where we know enough to carry out this action:
<variable declaration> --> id : <type> #insert_Id ;
Here, #insert_id is marked as an action with a # to distinguish it from nonterminals and tokens. It just means that an action must be taken at this point to insert the matched id and its type into the symbol table. In a recursive descent parser, #insert_id would be a call to new procedure or a method in a symbol table class that does the job of inserting Fred into the symbol table with type Integer, according to the above example. The new procedure or method should not be part of the parser, but ra
Semantic Records
In order for semantic actions to do their jobs, they will need to know what they are working with. In the above example, #insert_id would need to know that the id matched referred to lexeme Fred and that <type> was resolved to Integer. To keep this information around, we:
- assign a semantic record (object) to each token and nonterminal in the grammar.
The semantic records are used to hold the semantic information (meaning) of each symbol. For example, in the above instance, the semantic record for id in this parse would contain Fred, and the semantic record for <type> would contain Integer. The semantic record for id would be filled during the match of the id, and the semantic record for <type> would be passed to the procedure called type, which would fill it with Integer (in this case) and return it. The procedure Insert_Id would have two parameters:
procedure Insert_Id(in Id_Record, Type_Record);
Procedure Insert_Id would insert the lexeme in Id_Record and the type information in Type_Record into the symbol table. This means that your recursive descent parser will need to be upgraded to have semantic records and semantic action procedures.