Using distcc

Don't waste unused CPU cycles -- put them to work compiling software? Rod Smith shows you how to use distcc to harness the power of distributed computing to speed up your compile time.

Distributed computing is an increasingly important and popular idea. Most computers sit idle most of the time; even when you’re typing furiously in a desktop computer’s word processor, for example, that computer’s CPU will be mostly idle, assuming you’re not running other CPU-intensive tasks in the background. Enabling one computer to use another computer’s unused CPU cycles can be an effective way to speed up certain tasks. On a very large and very public scale, projects such as SETI@Home and Folding@Home do just this, using CPU time donated from countless desktop systems to perform CPU-intensive scientific computations.

Distributed computing need not be limited to big scientific projects, though; you can use this technique yourself to speed up some of your own computing tasks. One useful tool in this category is distcc, a set of programs that enables the distribution of C, C++, Objective C, or Objective C++ compilation tasks to multiple computers. When you deploy distcc on your own network, you can greatly reduce the time required to compile large programs. If you compile your own kernel or other programs frequently, or if you develop programs yourself in these languages, and if you’ve got access to even just a couple of networked computers, distcc may save you time.

Why Use distcc?

I’ve just outlined the main reason to use distcc: It can save you time. The principle is simple enough: If you’ve got two computers with equal CPU speed, distcc can split the job of compiling a program across those two computers, assigning each source code file to whichever computer finishes its last assigned file first. The theoretical best-case result with two computers is a halving of the time to compile a program– and that benefit becomes more pronounced the more computers are added to the compile farm. Of course, in practice, distcc imposes some overhead in the form of network transfers and its own CPU demands, so you won’t see quite the theoretical maximum speed increase, but it can still be quite substantial.

Unfortunately, distcc does have its drawbacks. It’s an extra system to configure, particularly if your environment is a mixed one– distcc uses each computer’s local C compiler, so if your network environment is mixed in CPUs or OSes, you may need to install cross-compilers on some of your systems (a task that’s outside the scope of this column). The CPU load on the computers could conceivably come at an awkward time, too– your boss might not appreciate his computer slowing down when he’s preparing a presentation!

Despite these problems, distcc can be a useful tool, particularly if you’ve got a few systems that are mostly idle and if you compile a lot of software yourself. Gentoo users are particularly likely to find distcc useful, since most Gentoo packages are compiled locally.

Configuring distcc Servers

Many Linux distributions today include distcc packages. Thus, you may be able to install distcc by using your package management system– for instance, by typing apt-get install distcc on a Debian or Ubuntu system, yum install distcc on RPM-based distributions that use yum, or emerge distcc on Gentoo. This should install the tools you’ll use on the system that’s officially compiling the source code and the network server and associated tools needed on the distcc servers– that is, the computers that work on behest of the main system.

If your distribution lacks a distcc package, you can search for a replacement at sites such as RPM Find (http://www.rpmfind.net) or download the source code from the main distcc site, http://distcc.samba.org. You’ll then have to compile and install the package yourself– a task you should be capable of doing without further advice if you need distcc! I do have one non-obvious piece of advice, though: When you run the configure script, pass it the --with-gnome or --with-gtk option; either option will enable compilation of a GUI monitor tool that can be useful in tracking down problems, as I describe later.

Before proceeding further, you may want to check the C compiler versions on your computers. In theory, distcc will work, and produce usable executables, when you use different compiler versions or even different compilers (GCC vs. ICC, for instance) on the various computers in your build farm. In practice, though, mixing different compiler versions, and especially different compiler packages, can cause problems.

This is particularly true when you compile C++ programs. If possible, you should upgrade when necessary to make sure all systems are running the same version of the compiler, or at least ensure that they’re as close as possible to each other. Although distcc is supposed to work with Intel’s ICC compiler, distcc is best tested with GCC, and distcc has some ICC-specific limitations

Note that you do not need to have all the libraries and header files for a project installed on all the computers in a compile farm; only the primary system needs these files. This system performs basic pre-processing on all the source code files and delivers the pre-processed files to the server systems, obviating the need for header files on the distcc servers. Linking the object files into a final executable is performed entirely on the developer’s computer, so libraries need only be installed on that computer.

With distcc installed and your C compiler versions synchronized with each other, you should check all your systems to be sure each has a user named distcc defined. This is the user that the distcc server uses for compiling software, if possible. The distcc user can be an ordinary user account, but it doesn’t need login privileges or a home directory.

Your distcc package should include a program called distccd. This is the distcc daemon, and it must be run on the distcc server computers. Systems that will be initiating distcc sessions may run this daemon, but don’t need to if other computers don’t need to access them. In the short run, the simplest way to do this is to launch the program manually:

distcc--daemon

When run in this way, distccd launches itself as a daemon and accepts connections from any system. If you want to restrict the systems to which distccd responds, you may add the --allow option. This is required for recent versions of distcc:

distcc--daemon--allow 192.168.1.0/24

This example restricts access to the 192.168.1.0/24 block of IP addresses. To allow access to a single computer, specify its IP address rather than a network block. You can include multiple --allow parameters to allow access from multiple networks or clients.

In the long term, you may want to configure distccd to run automatically. Most distribution-specific distcc packages include an appropriate SysV startup script, such as /etc/init.d/distccd. I recommend you examine this script; its details differ from one package to another, including methods used to set defaults such as hosts that are allowed to connect to it. You might need to edit files in /etc/distcc, edit the /etc/default/distcc file, or edit other configuration files. You may need to edit these configuration files or make changes with distribution-specific SysV startup script editing tools, such as sysv-rc-conf or system-config-services, in order to have the distcc SysV startup script run automatically the next time you boot the computer.

Configuring distcc on the Master System

Preparing a computer to initiate distcc- based compiles is a matter of setting environment variables or editing configuration files and then telling the computer to use distcc rather than gcc or some other compiler to do the work. Behind the scenes, distcc will call gcc or other compilers, but it will distribute at least some files to other computers when it does so.

The most important environment variable is DISTCC_HOSTS, which specifies the computers you want to function as distcc servers:

export DISTCC_HOSTS="localhost kernighan ritchie stroustrup"

This line tells distcc to use four computers to compile software: localhost, kernighan, ritchie, and stroustrup. You should, of course, adjust the list of systems for your network; however, you’ll typically include localhost in the list, and it should normally appear first– distcc uses the systems in the specified order, so placing localhost and your fastest computers first in the list can provide a modest speed boost. You can place this line (minus the export keyword) in your ~/.bashrc file to automate the process on subsequent logins.

As an alternative to defining the DISTCC_HOSTS environment variable, you can specify a list of hosts in the ~/.distcc/hosts or /etc/distcc/hosts file. This file should contain a list of hosts, all on one line, separated by spaces.

Compiling a Program with distcc

Ideally, compiling a program using distcc requires only a simple change to the make command you normally issue. With most programs, you configure them (you’ll still need to do this locally) and then type make. With distcc in place, instead of typing make alone, you need to include the -j num option to have make fork off the specified number of parallel compile operations. You must also set the CC environment variable to distcc to have make call distcc:

make-j 8 CC="distcc"

The number you pass using the -j option should normally be about twice the number of CPUs you have available; however, you might want to experiment with different values to discover what works best with your configuration.

The CC option can be handled in several ways, and it can be a source of complications. You can export this environment variable or define it in your ~/.bashrc file, if you like. Alternatively, you can modify your Makefile to specify distcc on its CC line. Passing CC="distcc" on the make command line gives you the flexibility to redefine it as required, however, and to use distcc to compile software you didn’t write without modifying its Makefile.

The CC variable can be a source of complications because you may need to define it differently than I’ve just described. In particular, distcc accepts a compiler name as an argument. In most cases, defining CC="distcc" is equivalent to defining CC="distcc gcc"; both definitions cause distcc to use gcc as the default compiler. You might need to tweak this definition in some cases, though. One of these is when you’re compiling a C++ program rather than a C program. In this case, you must define the CXX environment variable instead of or in addition to the CC variable. In particular, you should specify CXX="distcc g++". This definition tells distcc to use g++ to compile C++ programs. If you fail to include this definition (on the command line, in an exported predefined variable, or in your Makefile), strange things can happen, including linking errors, compilation errors, and failure of distcc to even attempt to contact its servers.

Monitoring distcc‘ s Actions

When you compile a program using distcc, you’ll see the normal output of your C compiler, including warnings and error messages. The main difference is that you’ll see the compiler name reported as distcc rather than whatever it would ordinarily be. You may also spot a few distcc- specific messages in the output, such as warnings that particular hosts couldn’t be located.

Another way to monitor distcc is to use a monitor program. By default, when you compile the distcc package from source code, a program called distccmon-text is also built. If you pass the --with-gtk or --with-gnome options to distcc‘ s configure script, a program called distccmon-gnome will also be built. These two programs enable you to monitor distcc‘ s activities.

You can use distccmon-text in one of two ways:

  • If you type distccmon-text with no arguments while a package is being built, the program reports what files are currently being compiled on which hosts. This is a one-time instantaneous report, so it’s useful for verifying that the hosts you expect to be in service really are doing their jobs.

  • To monitor activity in an ongoing fashion, add a number representing the update period. For instance, you might type distccmon-text 1 to have the program report compile jobs at 1-second intervals. This method enables you to track activity in an ongoing fashion.

In either case, the output of distccmon-text looks something like this:

$ distccmon-text
21393 Compile  main.cc     localhost[0]
21397 Compile  daqbase.cc  localhost[1]
21396 Compile  array2d.cc  seeker[0]
21400 Compile  clock.cc    seeker[1]

Most of this output information is self-explanatory– it includes a process ID (PID) number, a state, an input filename, and the host that’s processing the file. The state (Compile for all four lines in this example) is likely to be Compile, but it can be something else. If a host seems stuck at an odd value, such as Blocked, that may indicate problems.

A somewhat fancier monitor utility is distccmon-gnome, which is a GUI monitor utility that’s similar in features to distccmon-text. Unlike distccmon-text, though, distccmon-gnome always operates in continuous-update mode; thus, there’s no need to pass it any arguments. If you launch it when nothing is compiling, distccmon-gnome displays a blank window, but once you begin compiling a program with distcc, distccmon-gnome displays progress information.

The information presented is similar to that shown by distccmon-text; however, the Tasks column presents additional information: a color-coded bar indicating the amount of time spent in various types of activity. In my tests, this time appeared to be roughly evenly split between connect and compile stages for remote hosts; however, your values could be very different, depending on the speed of your network, the speeds of your computers, and the program you’re compiling.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62