dcsimg

The Future of the Linux Kernel:

After releasing Linux 2.4, Linus and Company spent most of 2002 stuffing the development kernel with gobs of new features. Here's a look behind the scenes of everyone's favorite Open Source project, and what to expect in 2003, the year of Linux 2.6.

kernel_01

While the Linux kernel is only a tiny piece of what most of us think of as a Linux system, it’s the features and characteristics of the kernel that set the limits of what an entire Linux system is capable of. As such, the kernel, the foundation of Linux, and the process that drives its development, is of great interest to many.

A year ago, at the beginning of 2002, kernel development was in an odd state. At that time, there had been no development kernel for almost a full year as the 2.4 kernel was slowly stabilized. (A description of what the kernel version numbers mean can be found in the sidebar, “What’s In a Version Number”) With no development kernel widely available, “patch pressure” — pressure to integrate new features into the kernel — was building, with no relief in sight, frustration levels running high, and progress seemingly at an all-time low.

A year later, at the beginning of 2003, the outlook is much, much brighter. New code is going into the kernel, the development process as a whole is working smoothly, and the anticipated release of Linux 2.6 in just a few months seems very achieveable.

However, the past year has not been without its ups and downs.




What’s in a Version Number?


A kernel release number, such as 2.4.12, tells you quite a bit about the release. A kernel release number has three parts:


  1. The major release number (2)

  2. The minor release number (4)

  3. And tthe patch level (12)

Major releases connote fundamental changes in the kernel architecture. So far, there have been only two major releases of the Linux kernel.

Minor releases connote less radical architectural change, though even minor releases of a Linux kernel version can bring significant improvements in performance and features. So far, version 2 of the Linux kernel has had four minor releases.

The patch level of a kernel denotes an official release that groups a set of patches (fixes and modifications) made to the kernel.

Generally, work on the Linux kernel proceeds along two parallel paths. A development kernel emphasizes cutting-edge features and performance. Consequently, development kernels resemble “Beta” software: those kernels sometimes contain troublesome bugs. Development kernels are assigned an odd minor release number (the next development kernel will be Linux 2.7). Eventually, development kernels form the basis for the next stable kernel release. A stable kernel emphasizes ongoing refinement and eradication of defects. Stable kernels, which are assigned even minor release numbers, generally do not contain serious bugs.

The code from the Linux 2.5 development kernel is becoming the stable Linux 2.6 kernel described in this feature.

The Development Process

The Linux kernel is a large body of code developed by hundreds of programmers, which is ultimately linked into a single, critical executable. A few other open source projects rival the kernel in terms of sheer size, but almost none require so many developers to integrate their code so tightly into a single context.

The kernel’s development process is critical to its success or failure. If all those kernel hackers are unable to work together, the project would eventually collapse. Indeed, predictions of the Linux kernel’s demise have been heard for years, and some were voiced in 2002 as well. So it’s interesting to see that twelve months later, Linux kernel development appears to be more robust, vital, and sustainable than ever.

The beginning of that year was not auspicious, however. Early 2.5 releases were dominated by the block layer rewrite, and the complete “rototilling” of that fundamental layer of the kernel had a couple of consequences: the development kernel releases were sufficiently unstable that even the kernel hackers were afraid of them, and a great many important patches, mostly fixes for problems, were falling through the cracks. It appeared, once again, that Linus had hit his limits: he simply could not manage hundreds of little (but important) updates, and there was real risk that many of those fixes would be lost and never applied.

That was when Dave Jones sighed, “Something tells me I’ll regret this later,” and stepped up and volunteered to collect small patches, maintain them, and feed them to Linus, where they’d be accepted or explicitly rejected. The resulting -dj kernels became the staging area for fixes on their way into 2.5, and the risk of dropped patches was greatly reduced. The -dj kernels have faded from view over the course of the previous year, but they were a crucial part of the early 2.5 process.

In February 2002, Linus took an important step: he started using the BitKeeper code management system. Using BitKeeper had been under discussion for years, but BitKeeper never quite reached the point where Linus was willing to make the jump. BitKeeper has also been a controversial option, since BitKeeper is not free software. To this day, many kernel hackers refuse to use it for very that reason, and BitKeeper is always fertile ground for yet another linux-kernel flame war. Linus, uninterested in licensing battles, just wanted tools that worked.

As it turns out, BitKeeper has worked very well for him and many others. With BitKeeper in place, official kernel releases might not happen as frequently as they used to, and “pre-patches” have been eliminated altogether, but a real-time snapshot of Linus’ work, with change logs, is available to anybody who asks for it. BitKeeper appears to be very helpful to those who use it, yet doesn’t impinge upon those who don’t. It’s unlikely to go away anytime soon.

June 2002 saw the second Kernel Developers Summit, held in Ottawa, Ontario, Canada. Most of the major kernel developers were there, and they used the time to resolve many development issues. There was one decision made there, however, which clearly stands out: the establishment of October 31, 2002 as the date of the 2.5 feature freeze.

Oddly, no development kernel prior to 2.5 has ever had an established feature freeze date. Instead, Linus would wake up one morning and post a note announcing that a feature freeze had just been imposed. The inevitable result was a flood of “must include” patches, many of which would somehow get applied. Even if the development kernel was reasonably stable before the “freeze,” it tended not to be afterward. Moreover, the feature freezes themselves were notoriously “slushy.”

At the Kernel Summit, the developers posited that an established freeze date would let everyone plan their work accordingly. With luck and a schedule, it was hoped, there’d be no post-freeze patch flood, making the freeze more effective.

As of early November, shortly after the freeze, things seemed to have worked as intended. A final surge of last-minute features was sorted through in an orderly fashion, and there appears to be a strong desire to make the freeze hold.

Big-Iron Performance

One of the most focused and significant areas of 2.5 development will result in very few new features per se. For years, Linux kernel development has had, as one goal, improved performance on ever-larger systems. Each new, stable kernel has done better in this regard, but none has come close to the scalability of the 2.5 development kernel.

The block-layer overhaul (see, “What a Long, Strange Trip It’s Been,” on page 18) was the beginning of this optimization effort. The block device layer in 2.4 and previous kernels suffered from a number of performance problems, including a lack of asynchronous I/O support and a tendency to split large operations into many small, single-block requests. 2.5 features an all-new request mechanism that keeps large operations intact, resulting in much-improved performance.




What A Long, Strange Trip It’s Been


While the kernel development team has made amazing progress in 2002, the project ran into its share of bumps along the way.


At the beginning of 2.5, the IDE/ATA disk driver subsystem was in need of work. While the code worked well for most users, it was generally held to be ugly and hard to maintain, and many outstanding patches were waiting to be merged. Martin Dalecki staged a form of hostile takeover, and set himself up as the new IDE maintainer. In that role, he produced over 100 patches which made drastic changes, all of which were accepted by Linus.


Unfortunately, not everybody was happy with Martin’s work. His patches ripped out features that people needed, played loose with the ATA standards, and often suffered from a lack of testing. Eventually a “foreport” of the 2.4 IDE/ATA layer was posted so that other developers could have a stable disk subsystem to work with.


When Martin finally got tired of the criticisms and quit working on the code, the entire body of his work was simply dropped in favor of the 2.4 code. The public rejection of six months’ work was hard. The kernel development process may produce good code, but it can be difficult for the developers.


A similar story was repeated in the new kernel configuration and build subsystems written by Eric Raymond and Keith Owens, respectively. At the first Kernel Summit, in 2001, Linus all but stated that he would merge those developments at the beginning of the 2.5 series. That merge never happened, and both Raymod and Keith eventually abandoned their projects in disgust.

The new “deadline I/O scheduler” reorders requests aggressively for improved performance, while simultaneously ensuring that no requests wait for too long.

Another decision made at the Kernel Developers’ Summit was that asynchronous I/O would be supported fully. In fact, all I/O operations within the kernel would be made asynchronous, with synchronous (“wait until it’s done”) operations implemented at higher levels. As we go to press, asynchronous I/O is well supported within the block layer — a feature that will make large database vendors very happy. Direct I/O (specifically those operations that go directly to or from user-space buffers without being copied through the kernel) is a natural companion to asynchronous I/O, and is supported well in 2.5.

The performance of the Linux virtual memory (VM) subsystem has disappointed users for many years. Not any more. The next stable kernel features a “world-class” VM that nobody need be ashamed of.


  • The new “reverse mapping” code makes it easier for the VM subsystem to find pages to swap out, and improves performance on memory-constrained systems.

  • Much better use is made of high memory, the memory that can’t be directly addressed on 32-bit systems.

  • Applications can request large pages, which greatly reduce page table overhead and the associated runtime cost.

  • The file writeout code, which is responsible for flushing buffered file contents back to disk, is now much smarter. It works with the new block layer for better performance, and it can make writeout decisions based on which disks are least busy at any given time. It’s claimed that the new code can keep sixty drives running at full speed.

  • Non-uniform memory access (NUMA) and discontiguous memory systems are better supported in 2.5 as well.

The 2.5 kernel also features a new scheduler which makes decisions in constant time, regardless of the number of processes running on the system. The new kernel is also much smarter about not moving processes between CPUs on symmetric multi-processor (SMP) systems, which is an important performance booster. Threading has also been vastly improved, to the point that a relatively modest Linux system can run 100,000 simultaneous threads and still be usable.

Linux has had top-tier networking performance for years, and the 2.5 kernel adds a few improvements. The most significant is the integration of the NAPI (“new API”) code that keeps the kernel from being overrun with device interrupts from high-speed adapters. TCP segment offloading improves performance by handing off some work to smart network cards. Asynchronous I/O support is being merged into the network stack as well.

With all of this work focused on large systems, one might wonder how the new kernel performs on the smaller systems that most Linux users have. Benchmarking has shown some performance regressions for some smaller systems. The kernel developers are well aware of the problems, and intend to address them during the 2.5 stabilization phase. In the end, 2.5 should be a performance winner for small systems as well.

Filesystems

The Linux virtual filesystem (VFS) code has continued to evolve in the 2.5 kernel. The new namespace feature allows every process to have its own view of how the filesystem is laid out. Several new filesystems have also been merged into the kernel, including ext3 (a journaling version of the classic Linux ext2 filesystem), JFS (a journaling filesystem from IBM), XFS (a high performance, journaling filesystem from SGI), and NFSv4 (the current version of the venerable Network Filesystem). Reiser4, a completely rewritten version of ReiserFS, is waiting in the wings and may yet be merged into 2.5. The ext2 and ext3 filesystems have been enhanced with directory indices, making operations on large directories faster.

The 2.5 kernel features a completely revamped disk quota subsystem. Quotas, and their associated files, are now handled by pluggable modules, making it easy for the kernel to support foreign quota systems. A new extended attribute structure has also been merged, paving the way for support of access control lists and other extended file metadata.

Internally, the VFS code continues to evolve. Most of the changes are not visible to anybody outside of the VFS subsystem, but the result is faster, cleaner, and safer code.

Finally, the kernel image can, itself, contain a filesystem image appended to the executable code. The initramfs code mounts this image at boot time, and delegates much of the boot-time work the kernel used to perform to programs found in the image. Tasks such as obtaining a network address with DHCP, finding the root filesystem, or mounting a remote NFS volume can now be handled with initramfs images, and need not be done by the kernel. It’s expected that more specialized applications, such as embedded systems, will be able to make good use of initramfs.

Devices

Every new kernel release supports more devices. The biggest change in 2.5, perhaps, is the addition of USB 2.0 support.

Many device-level subsystems have also been reworked in 2.5. The next stable kernel will ship with a much-improved IDE subsystem, despite the lengthy false start there. Much of the code has been cleaned up, new chipset drivers have been added, and the IDE code now supports tagged command queueing (TCQ). With TCQ, the kernel can pass several I/O requests to a device in parallel and let the device figure out the optimal processing order.

The long-awaited rewrite of the SCSI layer has not gotten as far as some had hoped. A certain amount of clean up has happened, and some functions are being moved up into the block layer. But the SCSI code, for the most part, remains as it was, warts and all. In the end, the SCSI layer is not as bad as some people make it out to be — it works for a lot of very demanding users.

The Advanced Linux Sound Architecture (ALSA) has, after years of development, been merged into the 2.5 kernel. ALSA completely replaces the old Open Sound System drivers with a new system that is better designed, more capable, and inherently compatible with a wider range of hardware. ALSA is particularly well suited to professional audio users who want to get the most out of high-end hardware.

Then there is the new device model. The Linux kernel has never had any sort of single, coherent registry of the devices attached to the system. For example, if you ask a 2.4 system what devices it has, it will be unable to answer. This lack of a registry worked fine as long as the system administrator knew what devices to expect and where to look for them, so it went unchanged for a long time.

But power management and hot plugging only exarcerbated the problem. All modern systems — servers included — are expected to manage and process power management tasks. The power bill for a big room full of servers is far less, for example, if those servers can turn off idle subsystems during slow times. But that sort of power management requires an understanding of how the system is put together. You can’t power down a USB hub before you interact with all devices connected to that hub.

The device model introduces a new device structure that represents any sort of attached device. These device structures can be connected together into a number of hierarchical data structures that reflect the physical topology of the system.

With this structure in place, the system knows enough about its own configuration to understand how all of its devices relate to each other. Each entry in the device hierarchy includes useful information, such as which driver is responsible for the device. It also includes pointers to the device’s power management functions. As a result, suspending the system (for example) is just a matter of walking the tree and calling each device’s suspend function. Power management remains a complicated task in general, but the job has been made easier by the new device model structure.

The device model code implements part of another fundamental, ongoing change in how the kernel views devices. Modern hardware tends to be hot pluggable, meaning that the device can be connected or disconnected without shutting down the system. The kernel is increasingly moving to a view that all devices are hot pluggable. So the device structure contains pointers to functions that can handle device probing, along with connect and disconnect events. As devices come and go, the device model core patches them into and out of the device hierarchy, connects drivers, and invokes the /sbin/hotplug script to do any necessary user space work.

The device core also exports a filesystem, which was long called driverfs. driverfs includes several directory trees that reflect the state of the system’s hardware. Among other things, it includes all the information needed for the /sbin/hotplug script to set up and tear down device names in /dev as devices come and go. The prevailing wisdom is that this mechanism will eventually displace the devfs filesystem, and many people like it because it moves all device naming and permissions policy out of the kernel and into user space. But the issue is far from settled, and you can expect a new round of devfs flame wars before it is all over. (As of the middle of November, new names are also being considered, and the final name is likely to be sysfs.)




In Memory of Leonard Zubkoff


kernel_02August 2002 brought the sad news that Leonard Zubkoff had been killed in a helicopter crash in Alaska. Leonard is best remembered as the author of the Buslogic (later Mylex) SCSI driver. Anybody who was trying to build serious SCSI-based Linux systems in the mid-1990′s chose the BusLogic card for its high level of support. Leonard contributed to numerous other drivers, the SCSI subsystem as a whole, and also contributed video card support to the XFree86 project. Old-timers may also recall his work on Lucid Emacs.


As Chief Technology Officer at VA Linux Systems, Leonard is credited with demonstrating the first four-processor Linux system. Leonard used his time at VA to push Linux onto ever higher-end hardware. He had much to do with the success of Linux on servers toward the end of the 1990′s.


In recent years, Leonard had retired from much of his active work to spend more time doing things he loved, such as flying helicopters into beautiful places. Leonard is well remembered as a brilliant engineer and a genuinely nice person. He is greatly missed.

Kernel Internals

Those who look at the internal kernel API may feel a little lost in 2.5. Much has changed there, and many of the fundamental assumptions of kernel programming have changed. The driving forces behind these changes are many, but at the top of the list are code correctness, maintainability, and performance.

One of the first changes in 2.5 was turning the internal kdev_ttype, which used to simply mirror the user dev_ttype (a short integer), into a structure. This change had the effect of breaking almost every bit of kernel code that goes anywhere near devices. In typical fashion, Linus introduced the change, did the minimum to get his system working, then put out the (still badly broken) kernel for everybody else to fix. The job got done within a couple of weeks. kdev_t remains a short integer (in disguise), but that will almost certainly change before the end of the 2.5 series. There is a pressing need to support large numbers of devices, and kdev_t must be able to accomodate growth. This growth is likely to involve more large changes, including the removal of the venerable static block and char device tables.

Another fundamental change is the preemptible kernel. Since the beginning, Linux kernel code has not been preemptible: code running in kernel space would not be interrupted (except, briefly, by device interrupts) until it returned to user space or voluntarily yielded the processor. Instead, 2.5 kernel code can be preempted if a higher-priority task needs to run. The result is much quicker response time for tasks that need very low latency, such as streaming audio and video applications.

Quite a bit of effort has been spent preventing preemption in atomic sections, such as when the code holds a spinlock. And most of the concurrency issues were already raised by the introduction of SMP support. The preemptible kernel, nonetheless, presents a potential trap to kernel programmers who expect their code to run uninterrupted.

Task queues have been eliminated. The interface was ungainly, and there were a number of pitfalls with how task queues were implemented. In their place, kernel developers have tasklets (which were introduced in the 2.3 series) and workqueues. Tasklets are for high-performance tasks which will not sleep. A workqueue, on other hand, have its own “worker thread,” and can thus handle longer-running tasks which might block.

Interrupt processing has been changed; the classic cli() and sti() functions are gone. It is no longer possible to globally disable interrupts on a multiprocessor system. Code which used those functions is still being reworked to use proper locking and local interrupt disabling.

Much thought has gone into how to improve the module loading and unloading process, which remains fraught with dangerous race conditions. A new module loading scheme, which performs all of the work inside the kernel, has been proposed, but not merged as of this writing. There is also talk of simply disabling the unloading of kernel modules on the grounds that it never can be made entirely safe.




Life on the Stable Side


With Marcelo Tosatti as its maintainer, the 2.4 kernel has certainly been stable in at least one regard: there have only been two releases in all of 2002. 2.4.18 came out on February 25, and 2.4.19 was released on August 3. 2.4.20 is, as of this writing, still possible before the end of 2002.


Marcelo has merged hundreds of patches in the past year, while fiercely holding the line against anything that looks like new features or could lead to instability. The 2.4 releases have been long in coming, but they have been reliable, solid performers.


By far, the larger of the two releases was 2.4.19. Along with an incredible number of fixes, this kernel included a new and improved IDE subsystem. In fact, the time needed to ensure the IDE changes were heavily tested was one of the reasons that this release was long delayed. 2.4.19 also merged the last set of big patches to the virtual memory subsystem, which had been completely replaced in 2.4.10. 2.4 VM still was problematic for some users until these patches went in.


The 2.4.20 kernel is likely to include the JFS journaling filesystem (for an in-depth, hands-on discussion of JFS, see the October 2002 issue, available online at http://www.linux-mag.com/2002-10/jfs_01.html) and quite a few fixes. Marcelo has promised that the 2.4.20 release cycle would be faster than 2.4.19.


The -ac patches to 2.4 are interesting since the kernels that most people actually use tend to be derived from Alan Cox’s patches. The -acseries has also been the gateway through which many patches have found their way into the 2.4 kernel. However, in recent months, 2.4-ac has become the testing platform for large IDE changes on their way into 2.5.

Other Notable Changes

The Linux Security Module (LSM) patch is being merged. LSM creates a large number of hooks where a loadable module can insert code to implement security policies. The LSM hooks are entirely restrictive: a hook can deny access that the kernel would have otherwise allowed, but it can not enable access that would otherwise be denied. In this way, the LSM authors hope to not introduce new security problems themselves. SELinux, the secured Linux distribution created by the U.S. National Security Agency, has been rewritten to work with the LSM framework.

One very late addition to the kernel was a new cryptographic API that supports a number of encryption and hashing algorithms. This marks the first time that cryptographic code has been integrated into the standard kernel (one can only hope that the legal environment remains friendly enough that the crypto capability can stay there). The first use for the new API is to support a brand-new IPSEC implementation. IPSEC was rewritten from scratch, rather than take it from the longstanding FreeS/WAN project. (FreeS/WAN imposes restrictions on who can contribute patches.) Before the 2.5 kernel is stabilized, it will probably also include cryptographic filesystem support as well.

Another interesting development that’s been merged is User-mode Linux (UML), which enables building and running a Linux kernel as a user-mode process. UML is exceptionally useful for many kernel development tasks, since a user-mode kernel can be examined with standard debugging tools and does not take down the host system if it crashes. UML is also attracting interest as a way of running secure virtual systems. Imagine giving multiple users access to their own UML instance on a shared system. Those users can be given root access to their system, yet will be unable to escape from the UML “jail” and interfere with each other or the underlying host system. (For extra fun, a few deranged souls are working on porting UML to Windows.)

The 2.5 kernel adds support for some new architectures as well. 64-bit PowerPC support has been merged, and will be put to good use in “big iron” systems. The current record for a full kernel compilation is held by a 32-processor PPC64 system: it ran in 7.5 seconds. Also of interest is support for the x86-64 (AMD Hammer) architecture. Hammer systems, when they become available, will have top-quality Linux support, and should make a very nice desktop (or server) box.

2.5 also includes the “software suspend” patch. Even if your desktop system does not have a built-in suspend operation like laptops do, this patch can make the machine suspend itself to disk and restore its state when powered up again.

Looking Forward

What kernel changes should be expected in 2003? Predicting what the kernel developers will do that far in the future is rather like trying to predict next summer’s weather: one can guess that it will be generally hot, but not much more.

In this case, the big event of 2003 should be the release of 2.6, the next stable kernel series. Most of the year will probably be dedicated to stabilization of the 2.5 kernel, and that’s if the feature freeze holds. If the freeze proves as slushy as it has in previous development kernels, we may still be waiting for a stable release this time next year.

Remember that the stabilization period continues after the stable release happens. In the case of the 2.4 kernel, this period lasted almost a full year. There is much interest in having the next stable kernel go more smoothly, so, with luck, a truly stable kernel will emerge more quickly. Still, it is unlikely that the next development series will begin in 2003.




The Horoscope for Linux 2.7


If you’re an open source operating system born in the Year of the Penguin, 2004 promises to be a very good year.


Between now and then, you’ll continue to pump “big iron,” with many patches for large systems (and NUMA systems in particular) waiting for you to embrace in 2004.


While you’re anxious to improve your IDE features, enhancements and serial ATA support will have to wait until 2004. Patience, in this case, is a virtue. In the mean time, strive to manage your memory better: think about including the shared page table patch. It will make others very happy. The stars also suggest that you implement better multipath I/O support — do that and your SCSI karma will be greatly improved.


For now, forget your grand schemes to eliminate static device numbers and extend the hotplug system to all devices. And, no matter what your friends say, you’re never too old to include a Linux Kernel Crash Dump subsystem. You’ll have better luck next time around.


Finally, expect your love life to heat up when you meet meet a lovely, reworked Enterprise Volume Management System based on LVM2.


Lucky you.



Jonathan Corbet is Executive Editor of LWN.net (http://www.lwn.net) and co-author of “Linux Device Drivers, Second Edition.” He lives in Boulder, CO, with his wife, two kids, and Jane the dog.

Comments are closed.