Software Development, the Apache Way, Part Four

With dozens of software projects involving hundreds of developers, keeping data flowing smoothly is an involved process for the Apache Software Foundation (ASF). With tens of machines distributed worldwide, gigabytes of daily downloads, and fifty hits per second on the Apache home page, system maintenance requires the varied skills of a small legion of volunteers. In the fourth in an ongoing, exclusive series, ASF co-founder Ken Coar pulls back the curtain to reveal how it all works.

The Apache Software Foundation, or ASF, is a non-profit organization chartered with creating freely-available software for the public good. But the ASF represents much more than that, though. The group is also an embodiment of a philosophy of collaboration, often called the “Apache Way.”

With dozens of software projects involving hundreds of developers, keeping data flowing smoothly is an involved process. The ASF looks both inward and outward, but focuses on different things according to the direction. When looking outward, the emphasis is on how the Foundation interacts with the rest of the world, safeguarding the Apache brand, participating in various standards committees, and so on. Looking inward, the focus is on helping all the participants do what they do — write code.

As the ASF has grown and its needs become apparent, individuals have stepped-up to meet the challenges and eventually coalesced into teams or groups devoted to a purpose. For instance, the Public Relations Committee is concerned with things like interactions with the press, fielding requests to use the Apache marks, and so on. There’s also the Infrastructure group, a team of overworked volunteers who manage the systems, the network, the software repositories, and all of the Web sites and other bits that allow the Foundation to get its work done.

There are actually two kinds of web sites associated with the Foundation. There’s the kind tied to a specific project, such as http://httpd.apache.org/ (for the Apache HTTPD Server) or http://tomcat.apache.org/ (for Tomcat), and there’s the kind which has no such specificity– like the main Web site at http://apache.org/ or the package download mirror system. The infrastructure team is responsible for seeing that all of the sites are accessible, but isn’t responsible for the content of those sites.

The infrastructure team is responsible for all of the sites, to various degrees. For each project-specific site, the infrastructure staff makes sure that the site is functional and can be updated by the members of the project involved. For the non-project sites, the infrastructure group has primary responsibility for both content and operation.

Tending the Code

Probably one of the most important responsibilities of the infrastructure team is keeping the source management systems in operation. Since the primary “product” of the ASF is software, source code, and documentation, it stands to reason that the continuity and integrity of the sources is of utmost concern. For years the ASF used the Concurrent Versioning System (CVS), but that package has some known deficiencies, but recently switched to Subversion (SVN, http://subversion.tigris.org/). (A number of people from ASF became involved with SVN while it was being developed, and provided valuable input to make Subversion better than CVS. This has proven quite beneficial: as the ASF moved from CVS to SVN, it had experts in the new software available “in house.”)

In the beginning, the infrastructure needs of the Apache development organization were quite modest: a web site, a mail server, and a software repository. Originally, all of these services lived on a single computer, sharing the system with a number of unrelated domains and handled by one of the Apache developers. As time passed and the various service loads increased, the Apache domain moved to its own dedicated hardware, and then to multiple machines, and most recently to systems that are geographically distributed. At the time I’m writing this, there are more than a dozen sizeable systems devoted solely to the Foundation.

The need for this much horsepower becomes understandable when you realize that the Apache infrastructure supports sixteen server systems; more than three hundred mailing lists; over one million email messages per day; over two dozen top-level projects; dozens of ASF project web sites; more than a thousand developers working on dozens of packages; fifty hits on the web site per second (more than four million per day); over one hundred gigabytes downloaded from the Web site per day; repositories containing over fifteen gigabytes of code and documentation; and hundreds of thousands of revision records. (This doesn’t even touch on the world-wide mirroring system. More on that in a bit.)

You can see the size and activity of some of these operations http://people.apache.org/~coar/mlists displays information about the Foundation’s public mailing lists, the number of people subscribed, and graphs of activity. http://people.apache.org/~henkp/ shows various collections of information about the relationships and activities within the Foundation. http://monitoring.apache.org/status/ is the Nagios page used by the infrastructure team to track system status. This system also automatically notifies the team in real time of incidents such as “system down.” http://svn.apache.org/viewcvs/ provides a view into the living repositories on which those hundreds of developers are busily working.

Graphs and charts are great for showing trends at a glance. For example, Figure One shows the count of committers over time, and Figure Two pictures ASF membership over time.

FIGURE ONE: The number of Apache committers over time

ASF member Stefano Mazzocci wrote a Java applet to scan the mail archives and graph the relationships between different correspondents. He called the applet “Agora,” from the Greek meaning social meeting-place. People who are more prolific, email-wise, tend to be toward the centre of the graph. The applet allows you to grab an individual and drag him or her out to one side and the rest of the graph adjusts to her new position– most probably by pulling her back into the relationship cloud again. Figure Three shows a static capture of the Agora display for the Apache Portable Runtime (APR) project’s commit reports and develepment discussion lists from November 2000 through October 2005. Each list has its own black circle; from the display you can see that the people actually committing changes to the code are a small subset of the entire population of the discussion list. This is one way to get a feel for contributors versus committers.

FIGURE TWO: The number of Apache Software Foundation members over time

Oiling the Engine

There are about two dozen people who form the infrastructure team and voluntarily support all of the machinery. Different individuals have different skills. Some mainly watch over the web servers, some focus on the repositories, some manage the email, and so on. In some cases there’s an adequate geographic distribution of capable people, but sometimes problems end up having to wait for resolution until the Earth rotates to let the Sun shine where the experts live.

FIGURE THREE: Agora displays the relationship between people on mailing lists

There are rare occasions when someone has to go deal with one or more of the physical systems. Sometimes the issue can be handled by the staff at the facility where the system is located, and sometimes one of the Foundation’s own people needs to take care of it. Hardware tools are used to make this as rare as possible.

For example, all of the systems are accessible through KVM devices that can make it seem as though you are sitting right in front of the system. That doesn’t help when a cable falls loose, however. Sometimes, there’s just no substitute for the personal touch.

Some of the subsystems interact with each other. For instance, when changes are made to the code repositories, details are sent through the Apache mail system to one or more of the mailing lists. When significant system events occur, notifications are sent to IRC channels. In some cases, if the “why this change was made” message that accompanies each commit happens to mention a bug tracker number, some glue code causes the tracker item itself to be updated. That’s a nice bit of automation, since it frees a developer from having to remember to go update the tracker item and copy and paste the explanatory text.

It would be difficult to arrange the various functions by importance. The Foundation creates software, so the repositories are very important. But it’s also about collaborative development, which means that the email system used for communication is very important as well. And access to the Foundation’s resources is done through the web, so the web server function is very important.

The ASF has been the beneficiary of a number of donations that have eased the unexpected growing pains. At this point, most of the equipment and connectivity requirements have been met, and the only overextended resource is the volunteers.

Change Is the Only Constant

As you can see, the scope of the infrastructure needed to support the ASF’s operations has expanded enormously in the last few years, and the growth has revealed unforeseen issues with scale. What made sense in the beginning — and was an ad hoc solution evolved by necessity — no longer works as well, if at all. Scripts and cron jobs pick up a lot of the slack and administrivia that is subject to automation, but the aspects that require human judgement or interaction remain. In fact, new such factors appear from time to time.

For instance, when the Apache project first started, there were two “classes” of participants: those with commit access to the sole repository, and those without. As new projects started up, each with its own repository area, the need to grind more finely arose — specifying who had access to which repositories. When the Foundation was incorporated and the concept of “members” was introduced, so was a new dimension, because ASF members have access to all aspects of the Foundation’s operation. This does not mean they all have commit access to all repositories, definitely not. But they do have the privilege of subscribing to and participating in any ASF mailing list, and have read access to all repositories and mail archives, including any private ones.

This was interesting enough when there was only one repository engine, CVS, in operation. When Subversion was added, relating the two different access control mechanisms added still more spice to the infrastructure team’s life. However, once the task was done it didn’t require any further attention, and the goal is to further simplify matters by migrating all the CVS repositories into Subversion over time. (Subversion has a number of particularly attractive advantages over CVS, including having repositories that can be browsed directly, a network protocol that doesn’t require its own unique port number, and built-in support for secure transactions over SSL.) Adopting a single source code management system will lighten the infrastructure team’s load slightly in one way, but can have hidden costs as well. Consider the effort to maintain an application in which no one has much interest, and having to track the little-used software to keep up with security alerts, making sure the neglected application doesn’t inadvertently cause problems or exposures when the rest of the system is upgraded, and so on.

Mirror, Mirror, on the Web

In addition to the main Apache site, the download area is mirrored world-wide using rsync. Over two hundred sites download copies of the Apache packages and make them available for people who find the mirror faster or more convenient than the main Apache server. (The mirroring system also means that the software packages are still available even when the Foundation’s infrastructure is down temporarily.) The mirrors collectively download more than 20 GB of data from the main download areas to their own copies.

One of the more recent and controversial additions to the Apache Software Foundation’s infrastructure was a wiki package. If you’re not familiar with wikis, you can think of them as rooms full of whiteboards that pretty much everyone can write on — and erase. In essence, wikis make the Web read-write rather than just read-only.

Some projects have embraced wiki technology as part of the way they communicate internally, and some haven’t, typically because the mechanism clashes somehow with the project’s established communication methods. One way in which wikis are significantly different is that they’re generally a “pull” technology, meaning that you need to go and check to see if anything has changed. In addition, the wiki pages are complete documents, so you may be hard put to read one and mentally compare it with what you remember it used to say. Of course, this particular drawback can be somewhat alleviated by treating the wiki like any other Apache information repository, and sending change summary messages whenever the wiki is changed.

Although the general preference of the infrastructure team is for the applications and tools they provide and maintain to be open source, it’s not a requirement. For example, although some emphasis is put on the use of Bugzilla and OpenSolaris zones, one of the more prominent tracking systms in use is JIRA (http://www.atlassian.com/software/jira/), and a lot of the software environments provided are, in fact, virtual systems running under VMware. Both JIRA and VMware are commercial products.

Content Management

The underlying function of the web server is among the duties of the infrastructure team, but the maintenance of the content is up to the individual projects. Each of the dozens of Apache projects has its own web site on the Foundation’s server, such as http://geronimo.apache.org/ or http://incubator.apache.org/. And each one has its own way of managing its site’s content.

Some are maintained by hand, treating the site’s HTML files like any other code module in a repository, and the site itself as a checked-out, working copy. Some use the Apache Forrest package, some use the semi-internal Anakia package; and some use entirely different methods. On the one hand, this makes sense, as it allows the projects to use the tools and applications they feel most comfortable with; on the other hand, though, it can mean that the infrastructure team may need to maintain dozens of tools.

For those projects that use a tool to generate the Web pages, yet another choice arises: should only the source for the site be kept in the repositories, or should both the source and the results generated by the tool be maintained under version control? One side of the discussion points out that the files are just output, like an executable from a build, and you don’t keep those in a repository. The other side points to the fact that having the generated files in the repository lets people refer to them from a working copy when offline, without having to have a duplicate installation of all the tools necessary to generate them from the source. Different projects have adopted different answers to the dilemma; there is none that is officially considered “best practice.”

A Squeaky Wheel

Considering that the ASF’s packages are cutting edge in network software technology, it’s interesting that the internal maintenance functions are in need of serious technological lubrication.

Most of the contributor energy goes into the projects and very little is devoted to the invisible operational infrastructure of the Foundation itself. Some things get automation attention: maintenance tasks, sure — backups, site builds, and data synchronisation — but the repetitive tasks largely handled directly by humans are still woefully effort-intensive.

For instance, it’s only fairly recently that the process of adding access for new committers became mostly automated. Given the Foundation’s antecedents, it would seem logical and proper for a lot of the work (at least that involved in making and fulfilling requests) to be handled using web-based forms. Alas, one of the obstacles there is the need for secure authentication — making sure that the forms are filled out by people authorized to do so). Another is the complexity of the access control that would be needed. Person X can make requests of type Y on behalf of project P, but not for project Q. And so on. Work to address this through personal SSL certificates is underway.

Since the sorts of contributors the ASF attracts tend to be motivated by personal needs, it’s really not surprising that the infrastructural tasks of the Foundation don’t attract hordes of developers. Great strides have been made in streamlining the management of accounts, code access, and email, but there’s still a long way to go. Although some improvements are planned, most get done in a “just-in-time” manner. As with all support roles such as system adminstration, when things are functioning smoothly, no-one notices. It’s not until there’s a problem that the curtain goes up and the behind-the-scenes activity is revealed.

In addition, most of the members of the infrastructure team aren’t there because it’s their first love. Rather, they might have gotten involved because they needed something done and the team was too overworked to get to it, or they might be answering friends’ pleas to come help out, or there might be any number of other reasons for their initial involvement. One of the issues, however, is that the infrastructure needs are so great that these developers who were taking time out from their usual projects sometimes find themselves burning out in more ways than one precisely because the infrastructure work isn’t their first love. Think of the fable of the boy with his finger plugging the leak in the dike. There’s no question that plugging that hole was something that needed doing, but it’s reasonable to assume that the boy would really have preferred to be somewhere else doing something else.

The very complexity of the environment is an obstacle to getting more help. Lots of the normal operational tasks require superuser access, and that’s not something given out lightly at the best of times. In an environment like Apache, which is elaborate, distributed, and has high world-wide visibility, anyone who is going to have Absolute Power on the system really needs to have more than an inkling of how it all works. Even if someone is only going to be receiving superuser access for one of the Apache systems, it’s important to understand how it fits in with the rest of the infrastructure and particularly the dependencies it has on others and they on it.

Hence, someone who’s going to help out on the infrastructure team in a superuser capacity needs to be a competent Unixish system administrator already, be a member of the Foundation, and have served some sort of apprenticeship with the existing team. Since the infrastructure team runs on merit and peer respect like the rest of Apache, the apprenticeship may be of zero duration if everyone already knows and trusts the new superuser.

By no means is a new member of the team expected to know all the ropes. It’s enough if everyone knows that she’ll ask questions when in doubt. As time passes, more and more of the folklore describing the infrastructure gets documented. Unfortunately, it’s a race between the haphazard efforts of the documentors and the growth of the Apache infrastructure itself. (Remember that most of the team are developers, and that developers are not generally known for taking joy in writing documentation.)

So there’s a peek behind the curtain at the machinery that lets The Apache Software Foundation function. You can pay no attention to the man behind the curtain if you like, but the curtain is transparent after all.

Ken Coar is a co-founder of the Apache Software Foundation.

Comments are closed.