dcsimg

Compiling and Linking: Under the Hood

These days, interpretive languages, most notably Perl, JavaScript, and Python, have made the barriers to entry for newly-aspiring programmers a lot lower than they once were. Perl, in particular, makes it easy for a newcomer to get his or her feet wet and leave the deeper mysteries that make for industrial-strength, high-performance software for later on. Languages such as C and C++ that typically get compiled all of the way down to real machine code are a different story, however. These languages, designed by professional software engineers for professional software engineers, generally assume that you, the programmer, are able to get down to the gritty details (and often idiosyncratic quirks) of the underlying hardware and the software development tool set you will be using.

These days, interpretive languages, most notably Perl, JavaScript, and Python, have made the barriers to entry for newly-aspiring programmers a lot lower than they once were. Perl, in particular, makes it easy for a newcomer to get his or her feet wet and leave the deeper mysteries that make for industrial-strength, high-performance software for later on. Languages such as C and C++ that typically get compiled all of the way down to real machine code are a different story, however. These languages, designed by professional software engineers for professional software engineers, generally assume that you, the programmer, are able to get down to the gritty details (and often idiosyncratic quirks) of the underlying hardware and the software development tool set you will be using.

This is where lots of people throw in the towel, because they have neither the time nor the patience for this grotesque exercise, especially the often thinly-documented basics of C and C++ software development tool sets.

The GNU Toolchain

Those who have been mystified by the reasonably well-hidden wheels and pulleys that lurk behind that deceptively simple gcc foo.c command should read on.








Compile Time Figure 1
Figure One: The GNU software development toolchain.

Figure One shows the flow through the GNU software development toolchain. Starting on the left is the C or C++ source code. To the right are the various transformations.

These days, there are relatively few people using the Objective C, Fortran, Modula, or Chill languages (although each of these does have its own GNU compiler and its own set of devotees), so from this point on we’ll focus only on compiling C or C++ programs using either the gcc or g++ compilers, respectively.

Compilers of the GNU family convert high-level source code into assembly code, which in turn is converted into machine code by an assembler and linker. The GNU assembler (gas, which can be invoked with the as command) and the GNU linker (ld) are used by default with the GNU compilers, and each understands the unique quirks and conventions of the others. You could substitute your own assembler and/or linker for the GNU ones if you really had a compelling reason to do so, but unless you are compiling for some obscure or novel CPU, you will never really find any good reason to do that.

It’s easy to get the feel of what assembly code is all about. Just write yourself a little “Hello world!” program in your favorite higher-level language (which we will assume is C) and then compile it with the special -S option that tells GCC (and also most other common Unix-hosted compilers) to compile your source code only down to assembly language and then to stop processing at that point.

So, for example, take a look at the following code:


#include <stdio.h>
int main (void)
{
printf (“Hello world!\n”);
return 0;
}

Save that in a text file, call the file hello.c, and issue the command:


% gcc -S hello.c

If all goes well, you’ll end up with another file in the same directory called hello.s (with the .s suffix, by tradition, indicating an assembly source file). See Listing One for the result, which will vary depending on your version of gcc and the operating system and hardware used.




Listing One: “Hello World” in Assembly Language


.file “hello.c”
.version “01.01″
gcc2_compiled.:
.section .rodata
.LC0:
.string “Hello world!\n”
.text
.align 4
.globl main
.type main,@function
main:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
subl $12, %esp
pushl $.LC0
call printf
addl $16, %esp
movl $0, %eax
leave
ret
.Lfe1:
.size main,.Lfe1-main
.ident “GCC: (GNU) 2.96 20000731 (Red Hat Linux 7.1 2.96-98)

There are three main points you should take note of here:


  1. The assembly language files produced by the GCC or G++ compilers proper are just ordinary human-readable text files.

  2. Inspecting these files can reveal a lot about how the GNU compilers convert high-level code down to assembly language code.

  3. Assembly language is, for the most part, just a symbolically rich version of machine code (which should be obvious once you know that addl and subl correspond directly to the 32-bit add and subtract instructions of the x86 instruction set).

The Assembler

Next, the assembler takes the program down another step in the development toolchain to a relocatable binary object file. Note that a relocatable object file is suitable for use only by the linker program, which is the final component of the toolchain.

If we wanted, we could directly invoke the GNU assembler to produce a relocatable object file via this command:


% as -o hello.o hello.s

The -o option and its following argument tell the assembler where to put the resulting relocatable object file. However, the preferred method for accomplishing this task is to invoke the GNU C or C++ compiler and then let the compiler invoke the assembler for you:


% gcc -c -o hello.o hello.s

In this command, the -c option tells the compiler that you want a relocatable object file as the output of the command. Note that the compiler is smart enough to infer from the “.s” extension that the input file (hello.s) is an assembly language file, as opposed to a C language source file. Using gcc is preferable to using as directly, as we’ll see later.

The Linker

In most cases, the relocatable object file that results from a compile and link step isn’t quite ready for direct execution by the CPU, because most such object files will refer to functions that haven’t been defined at all in the corresponding C or C++ source files. In the code shown above, the printf function is such an unsatisfied external reference, and we need to get the (compiled) code for it from somewhere before our program will be complete and ready for direct execution by the CPU. It’s the job of the linker to satisfy the reference and produce a truly complete executable program.

Unsatisfied external references are satisfied (by the linker) from one or two sources — other “.o” files or library files. Library files are themselves really just collections of “.o” files with a little indexing attached to make searching easier.

Library files can either be supplied by the programmer, or by the operating system. (Typically, Linux distributions provide a very rich assortment of precompiled, prepackaged, system-supplied library files containing functions for everything from basic system services to complex graphics, mathematics, GUI interfaces, data communications protocols, and lots of other things.)

Whether a library is supplied by you or by the operating system, it may come in one or both of two very distinct flavors — “static” libraries and “shared” libraries.

The only real differences between these two flavors of libraries is that shared libraries contain relocatable object files that have been specially compiled so that they can be used by multiple, otherwise independent programs at the same time (without the code in them having to be fully reproduced in multiple places in main memory), whereas static libraries contain code modules that have been compiled in a more traditional, and potentially more memory-hungry, way. (Static libraries are a topic for another column.)

To people who commonly work on Unix (or on some Unix-like system such as Linux), shared libraries may often be referred to simply as .so files. People who spend more of their time working on Microsoft Windows systems, however, typically refer to these as .dll files. (Shared library files will be the topic of next month’s column.)

The final tool in the GNU software development toolchain (i.e., the GNU linker — sometimes referred to as GLD) performs the following three major functions:


  1. Examining the relocatable object file for unsatisfied external references.

  2. Searching the available system-supplied and user-supplied libraries for functions needed to satisfy unsatisfied external references.

  3. Linking the relocatable object file with copies of the external functions so that everything will work together as an integrated unit, the executable object file.

The vast majority of programmers will never need to know in any great detail the specifics of the third and final step of this linking process. It can be quite complex, but a simple analogy is the setup process for a high-end home audio or video system — lots of things have to be correctly connected to lots of others things to make the whole system work. In the case of linking object files and components from library files together to form a final and complete executable program, the fine details of this process are not terribly interesting, nor is any knowledge of them necessary for enjoying the final product.

This process is, superficially at least, quite simple. Having already used the compiler and assembler to create relocatable object files (called one.o, two.o, and three.o for this example), this command would produce a binary executable named final if there were no unresolved external references:


% ld -o final one.o two.o three.o

Just as with assembly, the most preferable method for invoking lower-level tools of the GNU software development set is to let the compiler do the job for you:


% gcc -o final one.o
two.o three.o

Note that this is the same command — just substituting gcc for ld; so what difference does it make?

When you invoke the lower-level tools (like as or ld) via the compiler programs (gcc or g++), the compilers pass some additional command-line options to the lower-level tools. You ordinarily don’t see them, but they are frequently both helpful and desirable. Specifically, when you use gcc to invoke the linker, gcc adds several command-line arguments, one of which is -lc. This causes the linker to search in the system-supplied standard C library for functions not supplied by your own source files. This option is so frequently needed that gcc passes it to the linker (along with all the command-line arguments we supplied) unless you use the -nostdlib option to force gcc not to.

The automatic passing of the -lc option is just the tip of the iceberg, however. The curious can see all the gory details by using the -v option to the gcc or g++ command line (see Figure Two). This exercise provides an appreciation for the numerous complexities that the people who created the GNU software development tools had to contend with, and how much of this complexity they managed to hide behind the curtain, so to speak.




Figure Two: Verbose Output from gcc


% gcc -v -o hello.o hello.s
Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/2.96/specs
gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-97.1)
as -V -Qy -o /tmp/cc4eJWCp.o hello.s
GNU assembler version 2.11.90.0.8 (i386-redhat-linux) using BFD version 2.11.90.0.8
/usr/lib/gcc-lib/i386-redhat-linux/2.96/collect2 -m elf_i386 -dynamic-linker
/lib/ld-linux.so.2 -o hello.o /usr/lib/gcc-lib/i386-redhat-linux/2.96/
../../../crt1.o /usr/lib/gcc-lib/i386-redhat-linux/2.96/../../../crti.o /
usr/lib/gcc-lib/i386-redhat-linux/2.96/crtbegin.o -L/usr/lib/gcc-lib/
i386-redhat-linux/2.96 -L/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../..
/tmp/cc4eJWCp.o -lgcc -lc -lgcc /usr/lib/gcc-lib/i386-redhat-linux/2.96/
crtend.o /usr/lib/gcc-lib/i386-redhat-linux/2.96/../../../crtn.o

Do Try This at Home

This column has given you a high-level overview (from about 30,000 feet) of the process of compiling, assembling, and linking using the GNU software development toolchain. There are more details where these came from, some of them annoying, and some of them quite helpful. Each of the tools referred to in this column (gcc, g++, gas, and ld) has many pages of additional command-line options that may prove useful in various circumstances (if you are lucky enough to be using a Linux distribution that comes complete with a good collection of man pages for the GNU software development tools). Alternatively, you might try the documentation for the binutils package, which includes gas and ld, at http://sources.redhat.com/binutils (look for the “Documentation” link, which was for version 2.10 as of this writing). For the GNU compilers, try http://www.gnu.org/software/gcc for documentation and further information.



Ronald F. Guilmette is a software engineer known for products to thwart junk email. He can be reached at rfg@monkeys.com.

Comments are closed.