Don't waste unused CPU cycles -- put them to work compiling software? Rod Smith shows you how to use distcc to harness the power of distributed computing to speed up your compile time.
Distributed computing is an increasingly important and popular idea. Most computers sit idle most of the time; even when you’re typing furiously in a desktop computer’s word processor, for example, that computer’s CPU will be mostly idle, assuming you’re not running other CPU-intensive tasks in the background. Enabling one computer to use another computer’s unused CPU cycles can be an effective way to speed up certain tasks. On a very large and very public scale, projects such as SETI@Home and Folding@Home do just this, using CPU time donated from countless desktop systems to perform CPU-intensive scientific computations.
Distributed computing need not be limited to big scientific projects, though; you can use this technique yourself to speed up some of your own computing tasks. One useful tool in this category is
distcc, a set of programs that enables the distribution of C, C++, Objective C, or Objective C++ compilation tasks to multiple computers. When you deploy
distcc on your own network, you can greatly reduce the time required to compile large programs. If you compile your own kernel or other programs frequently, or if you develop programs yourself in these languages, and if you’ve got access to even just a couple of networked computers,
distcc may save you time.
I’ve just outlined the main reason to use
distcc: It can save you time. The principle is simple enough: If you’ve got two computers with equal CPU speed,
distcc can split the job of compiling a program across those two computers, assigning each source code file to whichever computer finishes its last assigned file first. The theoretical best-case result with two computers is a halving of the time to compile a program– and that benefit becomes more pronounced the more computers are added to the compile farm. Of course, in practice,
distcc imposes some overhead in the form of network transfers and its own CPU demands, so you won’t see quite the theoretical maximum speed increase, but it can still be quite substantial.
distcc does have its drawbacks. It’s an extra system to configure, particularly if your environment is a mixed one–
distcc uses each computer’s local C compiler, so if your network environment is mixed in CPUs or OSes, you may need to install cross-compilers on some of your systems (a task that’s outside the scope of this column). The CPU load on the computers could conceivably come at an awkward time, too– your boss might not appreciate his computer slowing down when he’s preparing a presentation!
Despite these problems,
distcc can be a useful tool, particularly if you’ve got a few systems that are mostly idle and if you compile a lot of software yourself. Gentoo users are particularly likely to find
distcc useful, since most Gentoo packages are compiled locally.
Many Linux distributions today include
distcc packages. Thus, you may be able to install
distcc by using your package management system– for instance, by typing
apt-get install distcc on a Debian or Ubuntu system,
yum install distcc on RPM-based distributions that use
emerge distcc on Gentoo. This should install the tools you’ll use on the system that’s officially compiling the source code and the network server and associated tools needed on the
distcc servers– that is, the computers that work on behest of the main system.
If your distribution lacks a
distcc package, you can search for a replacement at sites such as RPM Find (http://www.rpmfind.net) or download the source code from the main
distcc site, http://distcc.samba.org. You’ll then have to compile and install the package yourself– a task you should be capable of doing without further advice if you need
distcc! I do have one non-obvious piece of advice, though: When you run the
configure script, pass it the
--with-gtk option; either option will enable compilation of a GUI monitor tool that can be useful in tracking down problems, as I describe later.
Before proceeding further, you may want to check the C compiler versions on your computers. In theory,
distcc will work, and produce usable executables, when you use different compiler versions or even different compilers (GCC vs. ICC, for instance) on the various computers in your build farm. In practice, though, mixing different compiler versions, and especially different compiler packages, can cause problems.
This is particularly true when you compile C++ programs. If possible, you should upgrade when necessary to make sure all systems are running the same version of the compiler, or at least ensure that they’re as close as possible to each other. Although
distcc is supposed to work with Intel’s ICC compiler,
distcc is best tested with GCC, and
distcc has some ICC-specific limitations
Note that you do not need to have all the libraries and header files for a project installed on all the computers in a compile farm; only the primary system needs these files. This system performs basic pre-processing on all the source code files and delivers the pre-processed files to the server systems, obviating the need for header files on the
distcc servers. Linking the object files into a final executable is performed entirely on the developer’s computer, so libraries need only be installed on that computer.
distcc installed and your C compiler versions synchronized with each other, you should check all your systems to be sure each has a user named
distcc defined. This is the user that the
distcc server uses for compiling software, if possible. The
distcc user can be an ordinary user account, but it doesn’t need login privileges or a home directory.
distcc package should include a program called
distccd. This is the
distcc daemon, and it must be run on the
distcc server computers. Systems that will be initiating
distcc sessions may run this daemon, but don’t need to if other computers don’t need to access them. In the short run, the simplest way to do this is to launch the program manually:
When run in this way,
distccd launches itself as a daemon and accepts connections from any system. If you want to restrict the systems to which
distccd responds, you may add the
--allow option. This is required for recent versions of
This example restricts access to the 192.168.1.0/24 block of IP addresses. To allow access to a single computer, specify its IP address rather than a network block. You can include multiple
--allow parameters to allow access from multiple networks or clients.
In the long term, you may want to configure
distccd to run automatically. Most distribution-specific
distcc packages include an appropriate SysV startup script, such as /etc/init.d/distccd. I recommend you examine this script; its details differ from one package to another, including methods used to set defaults such as hosts that are allowed to connect to it. You might need to edit files in /etc/distcc, edit the /etc/default/distcc file, or edit other configuration files. You may need to edit these configuration files or make changes with distribution-specific SysV startup script editing tools, such as
system-config-services, in order to have the
distcc SysV startup script run automatically the next time you boot the computer.
distcc on the Master System
Preparing a computer to initiate
distcc- based compiles is a matter of setting environment variables or editing configuration files and then telling the computer to use
distcc rather than
gcc or some other compiler to do the work. Behind the scenes,
distcc will call
gcc or other compilers, but it will distribute at least some files to other computers when it does so.
The most important environment variable is
DISTCC_HOSTS, which specifies the computers you want to function as
export DISTCC_HOSTS="localhost kernighan ritchie stroustrup"
This line tells
distcc to use four computers to compile software:
stroustrup. You should, of course, adjust the list of systems for your network; however, you’ll typically include
localhost in the list, and it should normally appear first–
distcc uses the systems in the specified order, so placing
localhost and your fastest computers first in the list can provide a modest speed boost. You can place this line (minus the
export keyword) in your ~/.bashrc file to automate the process on subsequent logins.
As an alternative to defining the
DISTCC_HOSTS environment variable, you can specify a list of hosts in the ~/.distcc/hosts or
/etc/distcc/hosts file. This file should contain a list of hosts, all on one line, separated by spaces.
Compiling a Program with
Ideally, compiling a program using
distcc requires only a simple change to the
make command you normally issue. With most programs, you configure them (you’ll still need to do this locally) and then type
distcc in place, instead of typing
make alone, you need to include the
-j num option to have
make fork off the specified number of parallel compile operations. You must also set the
CC environment variable to
distcc to have
make-j 8 CC="distcc"
The number you pass using the
-j option should normally be about twice the number of CPUs you have available; however, you might want to experiment with different values to discover what works best with your configuration.
CC option can be handled in several ways, and it can be a source of complications. You can export this environment variable or define it in your ~/.bashrc file, if you like. Alternatively, you can modify your Makefile to specify
distcc on its
CC line. Passing
CC="distcc" on the
make command line gives you the flexibility to redefine it as required, however, and to use
distcc to compile software you didn’t write without modifying its Makefile.
CC variable can be a source of complications because you may need to define it differently than I’ve just described. In particular,
distcc accepts a compiler name as an argument. In most cases, defining
CC="distcc" is equivalent to defining
CC="distcc gcc"; both definitions cause
distcc to use
gcc as the default compiler. You might need to tweak this definition in some cases, though. One of these is when you’re compiling a C++ program rather than a C program. In this case, you must define the
CXX environment variable instead of or in addition to the
CC variable. In particular, you should specify
CXX="distcc g++". This definition tells
distcc to use
g++ to compile C++ programs. If you fail to include this definition (on the command line, in an exported predefined variable, or in your Makefile), strange things can happen, including linking errors, compilation errors, and failure of
distcc to even attempt to contact its servers.
distcc‘ s Actions
When you compile a program using
distcc, you’ll see the normal output of your C compiler, including warnings and error messages. The main difference is that you’ll see the compiler name reported as
distcc rather than whatever it would ordinarily be. You may also spot a few
distcc- specific messages in the output, such as warnings that particular hosts couldn’t be located.
Another way to monitor
distcc is to use a monitor program. By default, when you compile the
distcc package from source code, a program called
distccmon-text is also built. If you pass the
--with-gnome options to
configure script, a program called
distccmon-gnome will also be built. These two programs enable you to monitor
distcc‘ s activities.
You can use
distccmon-text in one of two ways:
If you type
distccmon-text with no arguments while a package is being built, the program reports what files are currently being compiled on which hosts. This is a one-time instantaneous report, so it’s useful for verifying that the hosts you expect to be in service really are doing their jobs.
To monitor activity in an ongoing fashion, add a number representing the update period. For instance, you might type
distccmon-text 1 to have the program report compile jobs at 1-second intervals. This method enables you to track activity in an ongoing fashion.
In either case, the output of
distccmon-text looks something like this:
21393 Compile main.cc localhost
21397 Compile daqbase.cc localhost
21396 Compile array2d.cc seeker
21400 Compile clock.cc seeker
Most of this output information is self-explanatory– it includes a process ID (PID) number, a state, an input filename, and the host that’s processing the file. The state (
Compile for all four lines in this example) is likely to be
Compile, but it can be something else. If a host seems stuck at an odd value, such as
Blocked, that may indicate problems.
A somewhat fancier monitor utility is
distccmon-gnome, which is a GUI monitor utility that’s similar in features to
distccmon-gnome always operates in continuous-update mode; thus, there’s no need to pass it any arguments. If you launch it when nothing is compiling,
distccmon-gnome displays a blank window, but once you begin compiling a program with
distccmon-gnome displays progress information.
The information presented is similar to that shown by
distccmon-text; however, the Tasks column presents additional information: a color-coded bar indicating the amount of time spent in various types of activity. In my tests, this time appeared to be roughly evenly split between connect and compile stages for remote hosts; however, your values could be very different, depending on the speed of your network, the speeds of your computers, and the program you’re compiling.