Programming Linux 2.6

Kernel 2.6 is finally here, and it touts several enhancements over the 2.4 series. The press has highlighted changes relevant to systems architects and managers, but there's plenty in 2.6 for application developers, too.

Kernel 2.6 is finally here, and it touts several enhancements over the 2.4 series. The press has highlighted changes relevant to systems architects and managers, but there’s plenty in 2.6 for application developers, too.

This month’s column provides an overview of some updates and new features in 2.6, including filesystem support, threading library changes, and the new kernel-level profiler. This article assumes you already have access to a machine running 2.6. For features that must be explicitly enabled, the kernel config option (such as CONFIG_PROFILING) is listed.

Asynchronous I/O

Asynchronous I/O (AIO) separates I/O operations from the calling function. Similar to running a shell command in the background, an application constructed with AIO can issue a series of long-running requests and immediately continue its other processing. Later on, the application can go back and check the results of those operations. AIO-enabled programs appear more responsive because the I/O operations occur independently of the application’s main event loop.

While it’s possible to realize asynchronous I/O with threads, the AIO calls do the work for you: you needn’t design your own framework to accept I/O requests and publish the results.

Once you’ve installed a recent build of the libaio library, you’re ready to AIO-enable your apps.

1. First, create a context using io_setup().

2. Associate a series of I/O calls with the context using io _submit().

3. Later, call io_getevents() to retrieve the status and results of context operations, or call io_cancel() to cancel them.

4. Finally, cleanup the context using io_destroy().

Don’t forget to install the developer version of the AIO package, such as Fedora’s libaio-devel, to get the header files.

Filesystems and Synchronous Directories

The advanced capabilities of the Reiser Filesystem (ReiserFS), the Journaling Filesystem (JFS), and Silicon Graphics’ XFS filesystem are hardly new, but they were previously available only as kernel patches. By upgrading to 2.6, conservative shops can now take advantage of these filesystems without patching their kernels.

Applications that rely heavily on the filesystem sometimes need something stronger than the standard ext3. For example, an application that places thousands of files in a single directory would benefit from ReiserFS’s scalability. Another application that does a lot of file I/O would be more resilient to system crashes with JFS’s journaling capabilities.

One feature new to all filesystems is synchronous directories. With a slight performance penalty, changes made in a synchronous directory are committed to disk before control returns to the caller. To make a directory synchronous, run chattr +S /some/directory. To verify that the bit is set, use lsattr -d /some/directory.

To enable ReiserFS, JFS, and XFS in the kernel, look for the CONFIG_REISERFS_FS, CONFIG_JFS_FS, and CONFIG_ XFS_FS options, respectively.

Access Control Lists

One deficiency of the traditional Unix permissions model is that it limits access control to a single user, a single group, and other (everyone else that isn’t the owner and not in the owning group). Sometimes, however, you want to grant access to several users that are unrelated (at the system level, at least). In 2.6, fine-grain permission can be achieved with access control lists (ACLs).

For example, the following command grants read-write access to the file semi-private.txt to bob, yet read-only access to peggy:

$ setfacl -m u:bob:rw,u:peggy:r semi-private.txt

Here -m modifies the ACL, and u specifies that user (as opposed to group) attributes are being changed. Similar to chmod‘s symbolic mode, r stands for read access and w for write access.

An overview of the ACL system is provided in the acl man page. C and C++ programs can alter ACLs using the acl_ set_file() system call.

Extended Attributes

Extended attributes (EAs) are key/value pairs of metadata, or information about the file that’s not part of its contents. While Linux EAs are limited to plain text, you can apply them in any number of novel ways. For example, you could implement a last-modified-by attribute for shared files.

EAs can be managed from command-line tools as well as a native API. To set attributes, use the command setfattr or the system call setxattr(); to fetch them, use getfattr or getxattr().

As an example, the following command and system call both set the pub_date attribute of the file article.txt to “June 2004″:

$ setfattr -n pub_date -v “June 2004″ article.txt

#include <sys/types.h>
#include <attr/xattr.h>
const char* attrName = “pub_date” ;
const char* attrValue = “June 2004″ ;
const char* file = “article.txt” ;
enum { NO_FLAGS = 0 } ;

setxattr( file, attrName, attrValue,
strlen( attrValue ), NO_FLAGS );

One caveat to ACLs and EAs: you must use tools that recognize ACLs and extended attributes (such as an updated version of tar or cp) or all “extended” information will be lost as you move files around.

Access Control Lists (ACLs) are available as module options for ext3, JFS, and XFS filesystems. Extended attributes are supported for ext2 and ext3.

The relevant kernel configuration options are CONFIG_EXT3_FS_POSIX_ACL, CONFIG_JFS_POSIX_ ACL, and CONFIG_XFS_POSIX_ACL for ACLs; and CONFIG_EXT2_FS_XATTR and CONFIG_EXT3_FS_XATTR.

The epoll() System Calls

Graphical user interface (GUI) programs and some daemons use poll() to watch for changes on a file. The new epoll system works like poll(), but is much more scalable: whereas poll() scrolls through its entire list of file descriptors to check for events, epoll registers callbacks on its file descriptors that fire when an update occurs.

To use the new polling system:

1. Create a special epoll file descriptor with epoll_create().

2. Use epoll_ctl() to add file descriptors to the watch list.

3. Call epoll_wait() to check for events on watched file descriptors.

4. Close the epoll file descriptor with the standard system call close().

OProfile and the System-Wide Profiler

Profiling is the first step in performance tuning: it shows where a program burns CPU cycles. Traditional profiling requires that you rebuild your program, so that the compiler can insert hooks into the object files. Those new binaries then generate data from which profilers (such as gprof) extract trace statistics.

The method works, but has flaws: even when rebuilding is an option, detailed traces require debug symbols (enabled with the compiler’s -g flag) that can often conflict with other compiler optimizations. So, all in all, you never profile the real, production program.

The new 2.6 kernel exposes a system-wide profiling interface that doesn’t require intrusive recompiles. It also supports profiling the kernel itself, and the system as a whole. In turn, the OProfile toolkit (http://oprofile.sourceforge.net) pulls in trace data via this kernel interface.

OProfile’s opcontrol configures and controls the profiler; opreport fetches profile data and can pull system-wide statistics or analyze a single program; and opgprof generates an input file readable by gprof. The OProfile web site provides additional documentation.

The system-wide profiling feature is enabled using the CONFIG_PROFILING option.

CPU Affinity

By default, a process in a multi-processor machine typically bounces between several CPUs. In some cases, explicitly binding a process to certain CPUs, or assigning CPU affinity, may yield several benefits:

* INCREASED CPU CACHE HITS. In caching, a hit occurs when data is pulled from a cache instead of being copied anew from the original, slower source. CPU affinity increases a process’s cache hit ratio.

* IMPROVED PERFORMANCE ON NUMA SYSTEMS. In NUMA systems, one CPU can be “closer” to a piece of memory than another. Relatively slow bus speeds may make it more efficient to use the local, burdened processor instead of a remote, idle processor.

* CONTAINING AN UNRULY PROCESS. A resource-intensive process can be limited to select processors, leaving the rest free for other tasks.

Linux 2.6 achieves CPU affinity with the system calls sched_ getaffinity() and sched_setaffinity():

#include <sched.h>
int sched_setaffinity(pid_t pid, unsigned
int len, unsigned long *mask);
int sched_getaffinity(pid_t pid, unsigned
int len, unsigned long *mask);

Specifying 0 as the pid argument gets or sets the affinity for the current process. len is the size of a word on the system. The mask argument is a series of bits that represent the system’s processors, where a set bit indicates the process may use that CPU. Therefore, unsetting all but one bit limits the process to that single CPU.

Great New Threads

The new kernel also brings several thread-related changes. For one, the kernel itself is preemptive: some kernel-space operations can be interrupted to yield to user processes. This is especially relevant for GUI applications, which require maximum responsiveness.

Second, the kernel is based on a 1:1 model, in which a kernel thread is available for each user thread. The internal O(1) scheduler lets the kernel efficiently handle a greater number of threads than previous versions, so this doesn’t burden the system. Better still, thread creation and tear down are both faster and less costly.

Kernel 2.6 includes support for the Native Posix Thread Library (NPTL). Among other benefits, the enhanced POSIX compliance improves signal handling. For example, it’s possible to send a signal (such as SIGSTOP) to an entire multi-threaded process.

However, migration of existing code to NPTL isn’t automatic: you’ll have to rebuild your application to take advantage of its features. Several new thread functions are available, and some underlying library changes may wreak havoc on old code. For example, all threads in a process report the same process ID (PID).

In spite of the backward binary compatibility, some older, non-NPTL code can still get confused running on a newer system. You can disable NTPL on a per-process basis by setting the environment variable LD_KERNEL_ASSUME to a previous kernel revision (say, 2.4.1 or 2.2.5).

Seqlocks and Futexes

Still on the topic of threads, 2.6 brings seqlocks and futexes.

Seqlocks fill a very specific niche: they protect shared access to non-pointer variables in sections of frequently-called code. To use a seqlock, wrap the data to be protected in calls to write_ seqlock() and write_sequnlock(). For example, using the age-old example of updating a shared counter, you’d write:

#include <linux/seqlock.h>

seqlock_t lock;
seqlock_init( &lock );
int counter = 0;

write_seqlock( &lock );
write_sequnlock( &lock );

A futex (or fast user-space mutex) is a synchronization primitive that heads to kernel space only to resolve contention. To prevent contention in the first place, it supports setting priorities on waiting threads. Other synchronization methods, such as semaphores and mutexes, are built on futexes.

The documentation explains that futexes aren’t for everyday development, but the API is available for anyone who wishes to explore (perhaps to create a new method of synchronization).

Core File Naming

Core dumps enhance the debugging process, from early development to production deployment. Whereas previous kernels created files of the format core or core.pid, 2.6 supports dynamic naming of core files based on printf()-style modifiers.

For example, %p represents the PID, and %h is the hostname. You can provide as much (or as little) detail as you want. Use sysctl to set the kernel.core_pattern variable. For instance, this command names core files for the hostname, process ID, and process owner (user):

# sysctl -w kernel.core_pattern=”core.%h-%p-%u”

/proc and /sys

If you’ve written tools based on the contents of the /proc directory, your code may be due for an update. As of 2.6, there are new entries in the /proc/pid/status and /proc/pid/stat files. The format of /proc/meminfo has also changed.

In addition to /proc and /dev/pts, kernel 2.6 introduces a third pseudo-mount called /sys. Where /proc contains information about running processes and kernel stats, /sys represents the machine’s hardware tree. (Some of /proc‘s hardware-related trees also exist under /sys now.)

For example, to determine whether the disk device sda is online, you could read the file /sys/block/sda/device/online.

To mount the /sys filesystem, run mount -t sysfs none /sys or add the proper line to /etc/fstab to make this permanent.

/sys is of interest to people writing hardware-related tools, similar to the procps suite that interfaces the /proc directory. It’s also closely tied to the kobject interface, which is relevant to writers of hardware kernel modules.

Loadable Modules

There are several changes to the kernel module system. First of all, there is a cosmetic change: kernel objects now have the extension .ko instead of .o.

Building third-party modules is more uniform, and folded into the kernel build process in a framework-like fashion. Simply drop your code into place and the build system takes care of adding the appropriate flags and such. With a proper makefile, an external module can be built with a simple command, such as:

$ make -C /path/to/kernel/source SUBDIRS=/path/to/module/source modules

Inside the code itself, the MODULE_LICENSE macro serves a twofold purpose: writers of third-party modules can identify themselves and their module’s license, and the running kernel can identify modules released under a GPL-compatible license.

Related to MODULE_LICENSE is EXPORT_SYMBOL_GPL(), which limits access of the current module’s exported symbols to other GPL-friendly modules. The kernel prevents non-GPL modules from accessing this data.

kexec: Linux Within Linux

When booting Linux on an x86 machine, the BIOS probes for hardware and passes control to the kernel. A patch for kernel 2.6 provides the kexec family of system calls, which permit the kernel loaded by the BIOS to load another kernel.

The ability to start Linux from Linux opens up new realms of possibilities, from faster reboots, to crash recovery, to booting the main kernel from devices not supported by most x86 BIOSs (after they’ve been probed by the first kernel). If you’ve worked with commercial Unix hardware (say, Sun’s or HP’s), you’ll recognize this last feature is sorely lacking on x86 machines.

Use of kexec requires the userspace kexec-tools suite (http://www.xmission.com/~ebiederm/files/kexec/) and a kernel patch (http://developer.osdl.org/rddunlap/kexec/).

Something for Everyone

The new Linux kernel has something for everybody: end-users, system administrators, and even application developers. If you’ve been holding out for the official kernel release to start updating your apps, your wait is over.

Ethan McCallum is a freelance technology consultant. He can be reached at ethanqm@penguinmail.com. You can read more about ACLs in “Halt! Who Goes There?” in the September 2003 issue of Linux Magazine, available online at http://www.linux-mag.com/2003-09/acls_01.html.

Comments are closed.