dcsimg

Compilers: The Front End

Over time, this column has discussed many programming topics and techniques. However, one subject we have never fully addressed is what actually happens at "compile time." How does a compiler take the program you write and translate it into something that a machine can understand and execute?

Over time, this column has discussed many programming topics and techniques. However, one subject we have never fully addressed is what actually happens at “compile time.” How does a compiler take the program you write and translate it into something that a machine can understand and execute?

As it turns out, many of the concepts and methods found in any compiler can be applied to other applications. A compiler is just an application that translates something (your code) written in one language to its equivalent in another. Just as the GNU Compiler Collection’s C compiler (gcc) translates C code into machine-readable code, you can think of a Web browser as a compiler that translates HTML into a human-readable page.

All compilers, independent of input and output format, work similarly. Figure One shows a diagram of the seven steps a compiler performs.








Compile fig1
Figure One: The compilation process

STEP ONE: Lexical analysis translates a stream of characters (the source code) into a stream of tokens. A token is a sequence of characters that represents a single building block of a programming language — a variable, a keyword, or even just a semicolon (a very important part of C and other languages).

STEP TWO: Syntax analysis determines if the source code (now represented by tokens) conforms to the syntax rules of the programming language. For example, a syntax rule might state that a variable must begin with a letter.

STEP THREE: Semantic analysis judges whether or not the source code adheres to the rules of the programming language. A rule, for example, might mandate that an integer and an array cannot be multiplied together with the * operator.

STEP FOUR: Intermediate code generation translates an abstract representation of the program to a form that is easy to manipulate, analyze, and generate other code from.

STEP FIVE: During intermediate code optimization the compiler improves the efficiency of the intermediate code by applying non-machine-specific optimizations.

STEP SIX: The object code generation step generates object code for a specific processor architecture.

STEP SEVEN: Object code optimization optimizes object code based on machine-specific nuances.

This month we’ll examine the “front end” of the compiler (consisting of steps 1 through 4). Next month we will focus on the “back end” of the compiler (steps 5 through 7).

Step One: Lexical Analysis

The first step in the compilation process is lexical analysis. In this phase, the compiler splits up the ASCII representation of the program into tokens. As mentioned earlier, a token is a sequence of characters that represent a single building block of a program. Tokens in C might include +, -, ->, int, and foo. The first three tokens are operators (two arithmetic and a pointer dereference), the fourth is a keyword, and the last is an identifier (used to name a function or variable).

Let’s look at the C source code in Figure Two. The text in this code would be translated into the list of tokens shown in Figure Three.




Figure Two: Example code


int main ()
{
float foo, bar, baz;
foo = bar + baz;
}




Figure Three: Lexical analysis


Token Token Type
‘int’ TYPE_KEYWORD
‘main’ IDENTIFIER
‘(‘ LEFT_PAREN
‘)’ RIGHT_PAREN
‘{‘ LEFT_CURLY
‘float’ TYPE_KEYWORD
‘foo’ IDENTIFIER
‘,’ COMMA
‘bar’ IDENTIFIER
‘,’ COMMA
‘baz’ IDENTIFIER
‘;’ SEMICOLON
‘foo’ IDENTIFIER
‘=’ OPERATOR
‘bar’ IDENTIFIER
‘+’ OPERATOR
‘baz’ IDENTIFIER
‘;’ SEMICOLON
‘}’ RIGHT_CURLY

White space is ignored, as it is unimportant in C. Lexical analysis does not perform any error checking. The code snippet int int int foo ( (; is translated into the series of tokens TYPE_KEYWORD, TYPE_KEYWORD, TYPE_KEYWORD, IDENTIFIER, LEFT_PAREN, LEFT_PAREN, SEMICOLON since each of the tokens are valid. To catch errors like this one, the compiler relies on the next step of the compilation process, syntax analysis.

Step Two: Syntax Analysis

In the syntax analysis phase, the compiler determines if the code (now represented as a stream of tokens) conforms to the syntactic rules of the programming language. Usually, these syntax rules are expressed in a special shorthand for programming languages called context-free grammars (CFGs). CFGs explicitly define the various constructs of a programming language and how they should be formed.

Figure Four shows a partial example of what the CFG for functions in C might look like. (This is by no means a complete definition of functions in C.) CFGs have terminals and non-terminals. Terminals (represented by uppercase words in Figure Four) represent actual tokens. Non-terminals (represented by lowercase words) are rules that can be expanded. Eventually, after expanding all non-terminals, you may get to a sequence of terminals. A complete CFG can be used to check that any given piece of source code conforms to the structure of the language.




Figure Four: A partial context-free grammar for C


Note: A | B means that a non-terminal can be expanded to either A or B.

function: type IDENTIFIER LEFT_PAREN optional_ arguments RIGHT_PAREN LEFT_CURLY statements RIGHT_CURLY

type: TYPE_KEYWORD | user_defined_type

optional_arguments: null | argument | argument COMMA arguments

statements: statement | statement statements

statement: declaration | equals_expr | …

declaration: type IDENTIFIER optional_declarations SEMICOLON

equals_expr: IDENTIFIER EQUALS_SIGN expression

expression: plus_expr | equals_expr | …

plus_expression: expression PLUS_SIGN expression

Given the example program from Figure Two, and the CFG from Figure Four, the compiler’s syntax analysis phase determines that the code is a legal function definition. Table One shows how the CFG definitions match up to the tokens derived from the lexical analysis phase. All of the non-terminals match up.




Table One: Syntactical analysis


CFG Definition Token Type Token
type TYPE_KEYWORD ‘int’
IDENTIFIER IDENTIFIER ‘main’
LEFT_PAREN LEFT_PAREN ‘(‘
optional_arguments nothing nothing
RIGHT_PAREN RIGHT_PAREN ‘)’
LEFT_CURLY LEFT_CURLY ‘{‘
statements TYPE_KEYWORD ‘float’
IDENTIFIER ‘foo’
COMMA ‘,’
IDENTIFIER ‘bar’
COMMA ‘,’
IDENTIFIER ‘baz’
SEMICOLON ‘;’
IDENTIFIER ‘foo’
OPERATOR ‘=’
IDENTIFIER ‘bar’
OPERATOR ‘+’
IDENTIFIER ‘baz’
SEMICOLON ‘;’
RIGHT_CURLY RIGHT_CURLY ‘}’

Step Three: Semantic Analysis

Syntax analysis finds syntax errors such as typos and invalid constructions such as illegal variable names. However, it is possible to write programs that are syntactically correct, but still violate the rules of the language. For example, the following conforms to the syntax of C, but is invalid nonetheless: we know that you cannot multiply integers and arrays.


int a[10];
int b;
b = a * b;

The code is syntactically correct because the CFG definition for a multiplication operation might look something like:


multiply: IDENTIFIER STAR IDENTIFIER

A CFG cannot validate that identifiers agree in some way nor can it reflect that some kinds of identifiers (i.e., integers and arrays) are simply not allowed to be multiplied. Constraints that dictate how variables, statements, and structures (and everything else) can be used in the language are specified with usage rules, or semantic rules. Semantic analysis makes sure that code conforms to the rules of the language that cannot be enforced by the syntax alone.

For example, in C, the semantic analysis phase performs type checking. The C compiler’s semantic analyzer enforces rules that require calls to functions to have correct argument types and the correct number of arguments. It also rejects variables that are used before they are defined. If your code violates any of the fundamental rules of C, the compiler flags the error(s) and stops.

Abstract Syntax Trees

To perform semantic analysis, the compiler must have additional information. For example, to analyze whether a multiplication operation is valid, the compiler must know the types of the two operands. Most compilers collect this contextual information during syntax analysis and refer to it during semantic analysis.

The compiler records this information internally in what’s called an abstract syntax tree (AST). An AST represents the code in a way that the compiler can easily traverse and manipulate as it performs the latter phases of compilation.








Compile fig5
Figure Five: An AST in its tree form

Consider the small code snippet bar + baz. Figure Five and Figure Six present two different, but logically equivalent, representations of the AST of the code. Figure Five shows a traditional tree. Figure Six is more concrete: it’s gcc‘s internal realization of the same AST. (The publically-available gcc does not have a command-line option or a built-in capability to dump its internal AST structures. The output presented here was emitted from a version of gcc created just for this article). Both representations of an abstract syntax tree capture the intent of the code: adding the variables bar and baz.




Figure Six: An AST inside gcc


PLUS_EXPR +
VAR_DECL bar
IDENTIFIER_NODE
NULL_TREE
VAR_DECL baz
IDENTIFIER_NODE
NULL_TREE

You might recognize both of the ASTs as forms of Polish Notation. Polish Notation places operators before operands (“* + 4 5 6″ is the equivalent of the infix expression “(4 + 5) * 6″) and allows mathematical expressions to be specified and evaluated without the need for parentheses. Compilers often translate source code into Polish Notation because the shorthand is easily represented as a tree and trees are easily traversed and analyzed.

Figure Seven contains the AST tree that gcc would generate for the expression a * (b + c). And finally, Figure Eight contains the AST tree that gcc would generate from the code listed in Figure Two.




Figure Seven: gcc’s AST for a* (b + c)


MULT_EXPR
VAR_DECL a
IDENTIFIER_NODE
NULL_TREE
PLUS_EXPR
VAR_DECL b
IDENTIFIER_NODE
NULL_TREE
VAR_DECL c
IDENTIFIER_NODE
NULL_TREE




Figure Eight: The AST for code from Figure Two


FUNCTION_DECL main
IDENTIFIER_NODE
FUNCTION_TYPE
INTEGER_TYPE int
RESULT_DECL
NULL_TREE
NULL_TREE
COMPOUND_STMT
SCOPE_STMT
DECL_STMT foo
VAR_DECL foo
IDENTIFIER_NODE
NULL_TREE
DECL_STMT bar
VAR_DECL bar
IDENTIFIER_NODE
NULL_TREE
DECL_STMT baz
VAR_DECL baz
IDENTIFIER_NODE
NULL_TREE
EXPR_STMT
MODIFY_EXPR =
VAR_DECL foo
IDENTIFIER_NODE
NULL_TREE
PLUS_EXPR +
VAR_DECL bar
IDENTIFIER_NODE
NULL_TREE
VAR_DECL baz
IDENTIFIER_NODE
NULL_TREE
SCOPE_STMT

If your code passes successfully through lexical, syntactical, and semantic analysis, it’s valid. Time for step four.

Step Four: Intermediate Code Generation

By this point, your source code has been translated into a very large AST. During intermediate code generation, the AST is converted to an intermediate form that is both programming language-independent and machine-independent. Think of the intermediate form as a generic programming language. The front-end of the compiler processes a specific input language and generates the intermediate form. The back-end reads the intermediate form and generates a specific output language, typically machine code for a specific processor.

By avoiding specifics, translation (steps one through three) can be separated from generation (steps five through seven). After generating the intermediate form, the front-end work of the compiler is done.

Next month, we’ll take a look at the latter three steps in the compilation process — the so-called “back-end” of the compiler — and see how a compiler applies both general and architecture-specific optimizations to maximize the performance of your code. In the mean time, happy hacking!



Benjamin Chelf is an author and engineer at CodeSourcery, LCC. He can be reached at chelf@codesourcery.com.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62