An Overview of the Translation Process

The Objective of a Programming Language Translator

Given a program in a source programming language, the objective of a translator (compiler) for that programin language is to translate the source program into some target language.

       Source Program                                 Target Language
  --------------------->  Translator (Compiler)  ----------------------->
            File                                          File

More generally, the input file is just an ASCIIi (or unicode) text file. The translator must determine whether the text file has the form of a source program in the programming language at hand, and, if so, translate it properly into an equivalent program in the target language.

Notice that for this scheme there must be a different translator for each source programming language and target machine pair. That is, if a company wants to write translators for Ada, Java, C#, and C++ for the Intel 80X86 computer line (X > 2), the Power PC based computers, for DEC (now Compaq, now HP) Alpha computers, and for Sun Sparcstations, it must write 12 different translators, one for each source language and machine pair:

  Ada Program
                    Intel Machine ----------------->  translator  ------------------->     File
                                  Code File

  C++ Program                    Intel Machine ----------------->  translator  ------------------->     File                         Code File

  Java Program                   Intel Machine ----------------->  translator  ------------------->     File                         Code File

. . .

  Ada Program                    Sun Machine ----------------->  translator  ------------------->     File                         Code File

  C++ Program                    Sun Machine ----------------->  translator  ------------------->     File                         Code File

  Java Program                   Sun Machine ----------------->  translator  ---------------->       File                       Code File

The problem is a little bit more complicated even than this. Each different operating system (e.g., Linux running on a PC, or Windows XP running in the same PC) require different things of the compiler, so a compiler that translates to Intel machine code, say, must also be written to cooperate with a particular operating system. So, different compilers would need to be developed for an Intel Pentium processor for translating C++ for the Windows XP operating system and Linux, even though the machine language is the same (the Pentium machine language)

Think back to your operating systems and machine architecture courses to see why this is so. Many of the things that a program would be written to do, such as write to a disk file or to read from a CD ROM are things that user programs are not allowed to do. Instructions for initiating such activities are privileged instructions that can only be executed when the processor is running in privileged, or supervisor, mode. That is, to perform such operations, the user program must call a procedure in the operating system to do the job for it. This fact and other similar facts require that compilers be written to produce target machine programs that work with the target operating system as well as produce machine language for the processor in question. Thus, the above picture would more accurately look like the following::

  Ada Program                    Intel Machine/Windows ----------------->  Compiler  ------------------->     File                         Code File

  Ada Program                    Intel Machine/Linux ----------------->  Compiler  ------------------->     File                         Code File

  C++ Program                    Intel Machine/Windows ----------------->  Compiler  ------------------->     File                         Code File

  C++ Program                    Intel Machine/Linux ----------------->  Compiler  ------------------->     File                         Code File

  Java Program                   Intel Machine/Windows ----------------->  Compiler  ------------------->     File                         Code File

  Java Program                   Intel Machine/Linux ----------------->  Compiler  ------------------->     File                         Code File

. . .

  Ada Program                    Sun Machine/Sun OS ----------------->  Compiler  ------------------->     File                         Code File

  Ada Program                    Sun Machine/Linux ----------------->  Compiler  ------------------->     File                         Code File

  C++ Program                    Sun Machine/Sun OS ----------------->  Compiler  ------------------->     File                         Code File

  C++ Program                    Sun Machine/Linux ----------------->  Compiler  ------------------->     File                         Code File

  Java Program                   Sun Machine/Sun OS ----------------->  Compiler  ------------------->     File                         Code File

  Java Program                   Sun Machine/Linux ----------------->  Compiler  ------------------->     File                         Code File

One can see how much more complex this becomes when even more operating systems and computer types are involved. Many compiler companies focus on just a few machine architectures and operating systems. Of course, in the PC market, most focus on the Intel/Windows platform.

Reducing Complexity

One way to reduce the complexity of the situation, a tack followed by all commercial vendors, is to recognize that some parts of the compiler are independent of the actual target machine and operating system. For example, one big part of any C++ compiler is the code that turns a string of ASCII characters into meaningful units (called tokens in compiler parlance).

Scanning. The phase of a compiler that looks at a string of characters and makes sense of them is called the scanner. Suppose, for example that the following characters appear in a file as ASCII code: an x, a 1, a blank, an =, a blank, a 3, a 4, a semicolon, and an end of line character. The line represented by these characters is

x1 = 34;

The scanner must be able to determine that the characters x and 1 belong together and that they form a variable named x1. Similarly, the scanner must be able to recognize the = sign as an independent token standing for the assignment operator. Again, the scanner must be able to group the 3 and the 4 together as being a constant integer. It must also recognize the semicolon as a single entity. Finally, the scanner must ignore the blanks, end of line characters, tabs, and other characters that do not add to the content of the program it is scanning.

Syntax Analysis. The phase of the compiler that determines whether the tokens found by the scanner form a correct program is called the syntax analysis phase. For example, the input line

x = y++

would be flagged as syntactically incorrect if a semicolon is required at the end of the line. Similarly, the line

x = x * x -/ 5;

would be flagged as syntactically incorrect, because it is not formed according to the rules of the C++.

Syntax analysis of a source program can be done without considering the target machine or operating system.

Semantic Analysis. The portion of a compiler that converts the original meaning of a source program into the same meaning in a different form (e.g., the machine language of the target computer) is called the semantic analysis phase. Interestingly, semantic analysis can be done in machine independent fashion, too, and indeed this is the usual practice. A line such as

x = x + 3;

is first checked to ensure that it is syntactically correct, and then it is translated into an intermediate form that is closer to what a real machine language might be, such as

load  r1, x    -- load the value x into register 1 add   r1, '3'  -- add the constant 3 to the value in register 1 store r1, x    -- store the value in register 1 into x

Notice that the meaning of the line x = x + 3; has been captured in a different form. This is semantic analysis. This different form is not the machine language of any real computer, but it is closer to most real machine languages than was the original source program, so translating from this intermediate form into a real machine language is quite straightforward.

If we call the syntax analysis and semantic analysis steps the front end of a compiler, we now have the picture:

                                                                                       Intel Machine/Windows                                                               --> Compiler Back End 1 ----------------------->   Ada Program                           Intermediate Form    |                         target machine code  ----------------->  Compiler Front End  -------------------> |     File                                Code File            |                         Intel Machine/Linux                                                               --> Compiler Back End 2 ----------------------->                                                              |                         target machine code                                                              |                                                              |                         Mac Machine/Linux                                                               --> Compiler Back End 3 ----------------------->                                                              |                         target machine code                                                              |                                                              |                         Mac Machine/OSX                                                               --> Compiler Back End 4 ----------------------->                                                              |                         target machine code                                                              |                                                              |                         Alpha Machine/Linux                                                               --> Compiler Back End 5 ----------------------->                                                              |                         target machine code                                                              |                                                              |                         Alpha Machine/Other OS                                                               --> Compiler Back End 6 ----------------------->                                                                                        target machine code

In other words, much of the work of writing compilers for different platforms can be saved, in that one front end that does much of the work of syntax and semantic analysis can be written, and then a different back end for each newly targeted computer/OS platform can be written separately. If the intermediate form is chosen well, it can work for different languages, saving even more, as in the following diagram.

                                                                                       Intel Machine/Windows                                                                 --> Compiler Back End 1 ----------------------->   Ada Program                             Intermediate Form    |                         target machine code  ----------------->  Compiler Front End 1 ------------------->  |     File                                  Code File            |                         Intel Machine/Linux                                            ^                    --> Compiler Back End 2 ----------------------->                                            |                   |                         target machine code   C++ Program                              |                   | ------------------- Compiler Front End 2 --                    |                         Mac Machine/Linux   FIle                                                          --> Compiler Back End 3 ----------------------->                                                                |                         target machine code                                                                |                                                                |                         Mac Machine/OSX                                                                 --> Compiler Back End 4 ----------------------->                                                                |                         target machine code                                                                |                                                                |                         Alpha Machine/Linux                                                                 --> Compiler Back End 5 ----------------------->                                                                |                         target machine code                                                                |                                                                |                         Alpha Machine/Other OS                                                                 --> Compiler Back End 6 ----------------------->                                                                                        target machine code

And so forth, for each different programming language.

Compiling Java is Different

We will talk more about the Java concept as time goes on. For now, just notice that Java program translation normally stops at the intermediate form. That is, in the usual model, Java is not translated into the actual machine language of the target machine. Instead, it is translated into "byte codes," which is a virtual machine language. Thus, browsers that run Java programs must also have a Java Virtual Machine (JVM) plugin that interprets this virtual machine code. Notice that "byte codes" is not Java specific. That is, other programming languages can be translated into byte codes and the JVM will run those programs just fine. Indeed, compilers that target byte codes have been written for a number of programming languages other than Java, including Ada and Python.

Just in time compiling refers to the process of translating the byte codes, once they have been retrieved from some server, into native machine code on the user's computer "just in time" to run it on the user's computer. More about this later.

Tackling a Compiler

Compilers have been studied for a long time. There is a large body of work, both theoretical and practical, that we can apply as we study compilers. Thus, rather than trying to write a compiler from scratch as a term programming project, there is much to be studied that will guide how we approach the problem. We will study the compilation process in an organized fashion that lets you write various pieces as we progress.