GCC 4.0

GCC 4.0 has a new optimization infrastructure, and future versions of GCC will realize even more performance improvements. Here’s an overview of what’s new and what’s to come.

Version 4.0.0 of the GNU Compiler Collection (GCC) was released in April 2004. More than three years in the making — the last major revision, 3.x, was released in 2001 — 4.0.0 was a substantial upgrade of one of the most ubiquitous software development tools around.

During the last year or so, the GCC development team has produced a series of maintenance releases and has continued to work on new features for 4.1. But besides the usual grind of adding enhancements and tweaking GCC’s extensive code base, the team’s also keeping an eye towards taking the elite GCC, the de facto standard compiler on Linux and many other free operating systems, to the “next level.” Even the best can get better.

The biggest advance in GCC 4.x is the inclusion of a new optimization framework dubbed Tree SSA. To understand that term and realize why it’s such an important improvement, you first need to understand a little bit about GCC’s internal infrastructure.

GCC (like most compilers) begins by parsing and analyzing your source code. At the end of this phase, GCC has a high-level internal representation of the program that clearly expresses the semantics of the input program. For example, if the input program was the succinct…

extern int f(double);

int g(int i) {
return f(i);

… the high-level representation for the return statement looks (conceptually) like Figure One. Since the representation in Figure One looks like (an upside-down) tree, the name of this representation is aptly called “Tree.”

FIGURE ONE: A high-level representation of a code fragment

One subtle aspect of the transformation from code to Tree can be seen by looking at the handling of i in the call to f(). The semantics of the C programming language dictate that the integer argument i should be implicitly converted to a double when calling f(). Similarly in C, identifiers that name functions (like “f”) implicitly “decay” to pointers. That implicit decay is also made explicit in the Tree representation.

In C, there are relatively few such implicit operations, but other languages have many more. For example, in C++, the compiler must instantiate templates, determine which of several overloaded functions is being called, generate code for virtual function calls, and generate calls to constructor and destructor functions when variables are declared. The front-end of the compiler is responsible for making all implicit operations explicit.

At the same time, there are some low-level, machine-specific operations that remain implicit, even after the Tree representation has been generated. For example, in the Tree in Figure One, there’s still a call node whose children are the address of the function to call and the arguments to the function. On real machines, making a subroutine call is a complex process that involves placing the arguments in appropriate registers or on the stack, executing the actual call instructions, and then retrieving the result. All of that low-level detail is still implicit at the Tree level.

Simply, the Tree represents a new program that would be obtained by translating the program into C and making all operations explicit. Thus, to a first approximation, after a Tree has been generated, the rest of the compiler doesn’t care what programming language the original source code was written in.

The RTL Representation

The Tree representation is then further “devolved” into a machine-specific representation called the register transfer language (RTL). Each instruction in the RTL representation corresponds directly to a machine instruction. As a consequence, while the Tree representation for the same program might look very similar on two different machines, the RTL representation for two distinct kinds of machines probably look very different. Many details that were implicit at the Tree level — such as the precise registers in which arguments are passed, the kinds of branch instructions that are available, and whether or not there are hardware instructions to support floating-point operations — are explicit at the RTL level.

As a final step, the compiler generates assembly language by iterating over the RTL representation. That step is actually simple because the RTL representation corresponds directly to assembly code. You can think of this last step as converting from English written in the common Roman alphabet to English written in Braille.


If you’ve ever looked at the marketing materials for a compiler (including GCC’s own web pages), you’ve probably seen references to optimization, or the process by which a compiler tries to generate the best code it can.

For every input program, there are many permutations of assembly code that can correctly perform the computation the developer wants. However, the goal of an optimizing compiler is to generate the best assembly code possible for some appropriate definition of “best”.

The most common definition of “best” is “the assembly code that runs fastest.” However, on embedded systems, programmers sometimes wants the assembly code that’s “the smallest,” requiring the smallest amount of ROM. When you use the –O2 option to GCC, you’re requesting that the compiler generate code that runs as fast as possible, without taking too long to compile.

Unfortunately, no compiler can do a perfect job of optimization. (Indeed, there are mathematical theorems that prove this fact!) So, compiler developers are forced to do the best job possible, with the aim to do a very good job on most input programs and not too badly on any program.

The translation from source to assembly language described above omitted any optimization phase. In fact, optimization occurs throughout the entire compilation process. A little bit of optimization happens during the very first phase, even as the source code is translated to a Tree. For example, an expression involving only constants (such as 2+3) is simplified into 5. Or, in C++, the compiler sometimes omits calls to copy constructors when the language specification permits that to occur. These optimizations, while significant, are the simplest optimizations performed by the compiler; optimization experts think of these sorts of optimizations as relatively uninteresting.

In all versions of GCC before GCC 4.0.0, all “interesting” optimizations occurred in RTL. First, the compiler would perform high-level optimizations, such as moving a computation out of a loop if that computatation produced the same result every time. Then, the compiler performed low-level optimizations, including instruction scheduling, in which a preselected sequence of machine instructions are ordered so that they run as fast as possible on a particular processor.

Now, in GCC 4.0.0, many high-level optimizations are now performed at the Tree level rather than at the RTL level. The next section explains why.

Reasons to Change

Over the years, the GCC developers realized that it was impractical or impossible to do as much optimization as was desired on the RTL representation. There is not enough room in this article to describe all of the problems, but some basic examples are illustrative.

The most fundamental data structure used for optimization is the control flow graph (CFG). The CFG is a directed graph in which each node is a section of code that has only one entry point and only one exit point. In other words, once you start executing the first instruction in the block, you are guaranteed to execute every instruction in the block. However, there may be many ways to get to the beginning of the block, and many places to go after you leave the block. There is an edge from one block to another if you can reach the second block directly after leaving the first block.

For example, the CFG for the code snippet…

if (j != 7) {
i = 1;
} else {
i = 2;

printf (“%d”, i + j);

… looks like Figure Two. In the figure, there are two successor blocks to the j!=7 block, and two predecessor blocks to the printf block. Each “branch” of the conditional has only one predecessor block and only one successor block. The CFG data structure is incredibly powerful, because the compiler can use it to figure out how to move computations around and how to simplify them.

FIGURE TWO: A sample control flow graph (CFG) for an if-then-else clause

For example, if the compiler can see that no matter how you reach a particular block, the variable i has the value 3, the compiler can replace i with 3 in that block, even if it is possible for i to have other values in other blocks. Replacing a variable with a constant may then may permit further simplifications.

Or, if there is no way to reach a particular block, then that code can be eliminated completely. Or, as another example, if the same computation is performed in two blocks and every path that reaches either one of the blocks goes through some predecessor block, then the computation can be moved to the predecessor block, an optimization that shrinks the size of the program. And there are many more optimizations that can be performed on a control flow graph.

Unfortunately, building an accurate control flow graph at the RTL level is somewhat difficult. For example, at the assembly code level, you sometimes see “conditionally executed” instructions which, as their name implies, may or may not actually be executed depending on the state of some predicate register. The prologue code for functions in shared libraries often contains code that appears to be a control-flow transfer, but which is not really such a transfer. These kinds of details reduce the accuracy of the control flow graph. It’s possible to build a CFG for RTL, and GCC does in fact do so, but it’s difficult.

A fundamental tenet of compiler design is that the more the compiler knows, the better the code it can generate. Every single fact that the compiler can deduce about the program can result in better code. But a lot of information is lost in translation to the RTL level. Therefore, doing optimizations on RTL necessary results in inferior code.

In particular, the RTL representation has no notion of “type,” in the sense that programmers usually use that term. At the RTL level, the only types are things like “16-bit integer” or “32-bit floating-point value.” There’s no easy way to even tell whether a particular value is a signed integer or an unsigned integer, and there’s no way to tell that three consecutive 32-bit floating-point values stored in memory are actually a structure representing a point in three-dimensional euclidean space. For the underlying processor, it’s just data, and the same is true for RTL. (Processors do not know that a particular piece of data is signed or unsigned; the assembly code simply tells them whether to perform a signed or unsigned operation on the data.)

Type information is just one kind of information that is unavailable at the RTL level, despite being available in the source program and useful for optimization. Another example is knowledge about what memory is part of an array. At the RTL level, array accesses are transformed into pointer dereferences. As a result, it may no longer be obvious to the compiler that a series of pointer dereferences actually refer to the same array, therefore resulting in inferior code.

Another problem is that by the time RTL has been generated, the compiler has committed to “code shape.” For example, the compiler may have decided that some structure variables must live in memory on the stack, because they are too big to fit in registers. Even if the compiler can subsequently prove that part of the variable is not used by the program, and that, therefore, the value could fit in a register, it’s very difficult to find and correct all the places in the function that assume that the variable will be on the stack.

Finally, it’s nearly impossible to convert the RTL representation into static single assignment (SSA) form. Most modern optimization algorithms work most efficiently and most effectively on programs in SSA form.

Static Single Assignment Form

A program is in static single assignment (SSA) form when every variable is assigned to at only one location. The example above…

if (j != 7) {
i = 1;
} else {
i = 2;
printf (“%d”, i + j)

… is not in SSA form, because there are two assignments to i. To transform the program into SSA form, you need to somehow arrange that there is only one assignment to i.

To accomplish that, you must introduce a magical construct called the phi operator. The new version of the program, when shown using its CFG, would be:

FIGURE THREE: Using the phi operator to transform Figure Two into single static assignment (SSA) form.

The phi operator assigns a different value to i2 depending on which edge was used to reach the bottom block. If the code traversal came from the block on the left, the value in i0 is assigned to i2; otherwise, traversal came from the block on the right and the value of i1 is assigned to i2.

Of course, real hardware does not have a phi operator, so trying to construct an SSA form representation from a machine-level representation (like RTL) is very hard. There are additional technical difficulties, like the fact that at the machine level it’s sometimes hard to figure out what’s being assigned. For example, many machines have instructions that assign to only part of a register; that can make it very difficult to obey the SSA constraint that each register be assigned only once.

GCC 4.1 (which will probably be released in the first quarter of 2006) will take more advantage of the Tree-SSA form. Tree-SSA optimizations in development for GCC 4.2 have shown improvements of as much as 20% on popular benchmarks.

The basic goal for 4.0.x was to be no worse than previous releases. Though this may not seem a very ambitious goal, the work to move to Tree-SSA was approximately equivalent to trying to build a second story on a very old one-story house, replacing the foundation, reinforcing the walls, and bringing all systems up to code, without disturbing the tenants. In fact, most expectations were exceeded.

Improvements in GCC 4.0.x

There are a number of ways in which 4.0.x is a definitive improvement over any previous release of GCC.

For example, GCC now performs an optimization called scalar replacement of aggregates, which is extremely important for C++ programs. In a program that uses small structures, like…

struct Point {
double x, y, z;

… previous versions of GCC would have insisted on storing such a structure in memory and would copy the structure from place to place whenever it was passed as an argument to functions. Now, GCC is capable of splitting up the structure into separate “scalar” variables, so that they can be handled independently. As a result, the contents of the structure are much more likely to be placed into registers and much more likely to be copied. This kind of improvement is known as “reducing the abstraction penalty” because it allows programmers to write shorter, more abstract programs without sacrificing performance.

Another new optimization in GCC 4.0.x is partial redundancy elimination (PRE). This optimization looks for ways to move code into locations in the CFG that allow duplicate code appearing in multiple subsequent blocks to be eliminated. There are also a number of new optimizations that provide better optimization of loops. (Optimizing loops is very important, because most programs spend most of their time in relatively few loops. If you can make those loops more efficient, you make the whole program more efficient.)

You may also be asking, “How much faster will my program run with GCC 4.0.x compared to older versions of GCC? Is it worth it for me to upgrade?”

It’s very difficult to answer this question. For starters, many programs spend most of their time accessing the disk or waiting for network packets to arrive. There’s nothing a compiler can do to make such programs run faster. Even for programs that are “CPU bound”, the improvements seen by upgrading to GCC 4.x will vary widely, depending on the nature of the program, the processor on which the program runs, and the command-line options you use when invoking the compiler. Unfortunately, there’s no hard-and-fast rule about how to get the best performance out of the compiler. If you’re trying to squeeze the maximum possible performance out of the compiler, you will have to do some experimentation.

As a rule of thumb, CPU bound programs might hope to see as much as a 10 percent improvement in performance when using GCC 4.x.

Future Directions

In some ways, the GCC 4.0.x release is an intermediate point: the new infrastructure is in place, but it is not being used as fully as it will be in the future. Already, GCC 4.1.x contains a number of new optimizations that will doubtless improve performance further.

None of the GCC development team can say for certain what the future holds for GCC. Fundamentally, the GCC “community” remains a loose-knit collection of people of organizations who all have different objectives, and those objectives are subject to change as interests shift and as business needs change. However, it’s likely that a major focus of GCC in the coming few years will remain performance of the generated code.

It’s also likely that many CPU and/or operating system manufacturers will continue to invest in proprietary compilers. Certainly, generating revenue is one of the motivations. Supporting an installed base accustomed to the particular behavior of their compilers is another. However, I do not believe either of these reasons is the overriding factor.

There are certainly technical faults in GCC. For example, while there have been many improvements made to the G++ front end and its language conformance is now very good, there is still a long list of bugs to fix and a few missing features to implement The error messages could also be more descriptive.

There are weaknesses in other aspects of GCC as well. For instance, there are still input programs that crash the compiler and cases where the compiler generates incorrect code for particular architectures. Making GCC bullet-proof would be a huge advantage with some parts of the developer community.

Ultimately though, performance remains the fundamental reason that organizations and individuals still invest in proprietary compilers. Many compiler vendors believe that because their compiler doesn’t have to be portable to many architectures, it can be specialized for a particular architecture and therefore get better code. If GCC consistently generated code than ran markedly faster than the code generated by the competing proprietary compiler on a particular platform, these companies would be less likely to justify continuing investment in their proprietary codebase.

That day will come within the next few years — ten at the outside.

Mark Mitchell is the Chief Sourcerer of CodeSourcery, LLC, and has served as the GCC Release Manager since GCC 3.0. He can be reached at class="emailaddress">mark@codesourcery.com.

Comments are closed.