Concurrency in the Kernel

Multiple threads of execution need to be synchronized to avoid data corruption and even system freezes.
As the Linux kernel has grown in complexity to support Symmetric Multi-Processing (SMP) and kernel preemption, more and more scenarios generate multiple threads of execution. Because threads can simultaneously operate on shared kernel data structures, access to such data structures has to be serialized. In this column, let’s learn the basics of protecting shared kernel resources from concurrent access, starting with a simple example and slowly introducing complexities like interrupts, kernel preemption, and SMP.

Spinlocks and Semaphores

A code area that accesses shared resources is called a critical section. Spinlocks and semaphores are the two mechanisms used to protect critical sections in the kernel.
A spinlock ensures that only a single thread enters a critical section at any time. Any other thread that wants to enter the critical section must wait “spinning its wheels” until the first thread exits. Listing One shows a basic use of a spinlock.
Listing One: Basic spinlock usage

#include <linux/spinlock.h>

/* Initialize */
spinlock_t mylock = SPIN_LOCK_UNLOCKED;

/* Try to acquire the spinlock. This is inexpensive if there
* is no one inside the critical section. In the face of contention,
* spinlock() busy waits.
spin_lock (&mylock);

/* … Critical Section … */

/* Release the lock */
spin_unlock (&mylock);

In contrast to spinlocks, which put threads into a spin if they attempt to enter a busy critical section, a semaphore puts each contending thread to sleep until its turn arrives to occupy the critical section.
Because it’s bad to consume processor cycles to spin, semaphores are more suitable than spinlocks to protect critical sections when the estimated wait time is long. In semaphore terms, anything more than two context switches can be considered long, since semaphores have to switch out the contending thread to sleep, and switch it back in when it’s time to wake it up.
Thus, in many cases, it’s easy to make a decision on whether to use a spinlock or a semaphore:
*If your critical section needs to sleep, you have no choice but to use a semaphore. It’s a deadly sin to schedule, preempt, or sleep on a wait queue after acquiring a spinlock.
*Since semaphores put the calling thread to sleep in the face of contention, you have no choice but to use spinlocks inside interrupt handlers.
Unlike spinlocks, which allow only a single thread inside a critical section at a time, semaphores can be configured to allow a predetermined number of threads into the critical section simultaneously. (However, semaphores that permit only a single thread at a time are more common in the kernel.) A semaphore that permits only a single thread at a time is called a mutex. Listing Two shows basic mutex usage.
Listing Two: Basic Mutex Usage

/* Architecture dependent */
#include <asm/semaphore.h>

/* Statically declare a mutex. To dynamically create
* a mutex, use init_MUTEX().
static DECLARE_MUTEX (mysem);

/* Try to acquire the semaphore. This is inexpensive if there
* is no one inside the critical section. In the face of contention,
* down() puts the calling thread to sleep.
down (&mysem);

/* … Critical Section … */

/* Release the semaphore */
up (&mysem);

To illustrate the use of locks, let’s start with a critical section that is present only in process context and gradually introduce complexities in the following order:
1.Critical section present only in process context on a uniprocessor box running a non-preemptible kernel (P-UP-N). This is the simplest case and needs no locking.
2.Critical section present in process and interrupt contexts on a uniprocessor machine running a non-preemptible kernel (PI-UP-N).
3.Critical section present in process and interrupt contexts on a uniprocessor machine running a preemptible kernel (PI-UP-P).
4.Critical section present in process and interrupt contexts on an SMP machine running a preemptible kernel (PI-SMP-P).

Case Two: PI-UP-N

In the case of a critical section present only in process context on a uniprocessor box running a non-preemptible kernel case, you only need to disable interrupts to protect the critical region:
cli ();  /* Disable Interrupts */
/* ... Critical Region ... */
sti (); /* Enable Interrupts */
However, if interrupts are already disabled when execution reaches cli(), sti() has the unpleasant side effect of re-enabling interrupts, rather than restoring interrupt state. This can be fixed with:
unsigned long flags;

/* Point A: Disable Interrupts */
save_flags (flags);
cli ();
/* ... Critical Region ... */
/* Restore state to what it was at Point A above */
restore_flags (flags);
This latter code works correctly, regardless of the interrupt state when
execution reaches cli().

Case Three: PI-UP-P

If preemption is enabled, mere disabling of interrupts won’t protect your critical region from being trampled. There is the possibility of multiple threads simultaneously using the critical section in process context. The solution, apparently, is to disable kernel preemption before the start of the critical section and re-enable it at the end. Spinlocks do that internally if CONFIG_PREEMPT is enabled:
/* Marker to turn OFF preemption */
spin_lock (&mylock);
/* ... Critical Region ... */
/* Marker to turn ON preemption */
spin_unlock (&mylock);
However, this still doesn’t prevent interrupt handlers from stomping through your critical section. Instead, use the IRQ variant of spinlocks:
unsigned long flags;

* Point A:
* Save interrupt state.
* Disable interrupts and preemption.
spin_lock_irqsave (&mylock, flags);
/* ... Critical Region ... */
* Enable preemption. Restore interrupt state to that
* at Point A.
spin_unlock_irqrestore (&mylock, flags);
Preemption state need not be restored to what it was at Point A, since the kernel does that for you via a variable called the preemption counter. The counter gets incremented whenever preemption is disabled (using the preempt_disable() function), and gets decremented whenever preemption is enabled (using the preempt_enable() function). Preemption kicks in only if the counter value is zero.

Case Three: PI-SMP-P

Now assume that the critical section executes on an SMP machine and that your kernel has been configured with CONFIG_SMP and CONFIG_PREEMPT turned on.
In the scenarios discussed so far, the spinlock primitives have done little other than enable and disable preemption and interrupts. The actual lock features have been compiled away.
In the presence of SMP, the actual locking logic gets compiled in, and the spinlock primitives are rendered SMP-safe. The SMP-enabled semantics is:
unsigned long flags;

* Point A:
* Save interrupt state on the local CPU
* Disable interrupts on the local CPU
* Disable preemption
* Lock the section to regulate access by other CPUs
spin_lock_irqsave (&mylock, flags);
/* ... Critical Region ... */
* Enable preemption. Restore interrupt state to what
* it was at Point A for the local CPU.
* Release the lock.
spin_unlock_irqrestore (&mylock, flags);
In the SMP case, only interrupts in the local CPU are disabled when the lock is acquired. If interrupts are not disabled in the local CPU, a deadlock might arise if an interrupt is generated while execution is in the middle of a critical section. (The interrupt handler sits in a tight spin, waiting for the lock to become available, but the critical section cannot complete until the interrupt handler itself finishes execution.) However, this danger of deadlock does not arise for interrupts generated on other CPUs, since an interrupt handler executing on processor A doesn’t prevent process context code on processor B from executing. The interrupt handler on processor A spins, waiting until processor B exits the critical section.
The kernel has specialized locking primitives in its repertoire that help improve performance under specific conditions. Using a mutual exclusion scheme tailored to your needs will make your code more powerful. Let’s take a look at some of the specialized exclusion mechanisms.

Atomic Operators

Atomic operators are used to perform lightweight operations like bumping counters, conditional increments, and setting bit positions, in one shot. Atomic operations are guaranteed to be serialized and do not need locks for protection against concurrent access. The implementation of atomic operators is architecture-dependent.
For example, to check whether there are any remaining data references before freeing a kernel network buffer (called an skbuff), the skb_release_data() routine defined in net/core/skbuff.c does the following:
if (!skb->cloned ||
/* Atomically decrement and check the reference value */
atomic_dec_and_test(&(skb_shinfo(skb)->dataref))) {
/* ... */
While skb_release_data() is thus executing, another thread using the skbuff_clone() function (defined in the same file), might be simultaneously incrementing the data reference counter:
/* .. */

/* Atomically bump the reference count */

/* .. */
The use of atomic operators protects the data reference counter from being trampled by these two threads. It also eliminates the hassle of using locks to protect a single integer variable from concurrent access.
The kernel also supports operators like set_bit(), clear_bit(), and test_and_set_bit(), to atomically manipulate a single bit.

Reader-Write Locks

The reader-writer variant of a spinlock is another specialized concurrency regulation mechanism. If the usage of a critical section is such that separate threads either read or write, but don’t do both, a reader-writer lock is a natural fit.
Multiple reader threads are allowed inside a critical region simultaneously. Reader spinlocks are defined as:
rwlock_t myrwlock = RW_LOCK_UNLOCKED;

/* Acquire reader lock */
read_lock (&myrwlock);

/* ... Critical Region ... */

/* Release lock */
read_unlock (&myrwlock);
However, if a writer thread enters a critical section, other reader or writer threads are not allowed inside. To use writer spinlocks, you’d write:
rwlock_t myrwlock = RW_LOCK_UNLOCKED;

/* Acquire writer lock */
write_lock (&myrwlock);

/* ... Critical Region ... */

/* Release lock */
write_unlock (&myrwlock);
Look at the Internetwork Packet Exchange (IPX) routing code in net/ipx/ipx_route.c for a real life example of reader-writer spinlock usage. A reader-writer lock called ipx_routes_lock protects the IPX routing table from simultaneous access. Threads that need to look-up the routing table to forward packets don’t have to write to the table and can request reader locks. However, threads that need to add or delete entries from the routing table have to acquire writer locks. This improves performance since there will usually be far more instances of routing table look-ups than routing table updates.
Like regular spinlocks, reader-writer locks also have corresponding irq versions, namely, read_lock_irqsave(), read_lock_irqrestore(), write_lock_irqsave(), and write_lock_irqrestore(). The semantics of these functions are similar to that discussed for regular spinlocks.
Corresponding reader-writer flavors of semaphores are, down_read(), down_write(), up_read(), and up_write().
Sequence locks or seqlocks, introduced in the 2.6 kernel, are reader-writer locks where writers are favored over readers. Writer threads do not wait for readers who may be inside a critical section. Because of this, reader threads may discover that their critical section operation has failed, and may need to retry:
do {
seq = read_seqbegin (&mylock);
/* ... Critical Section */
} while (read_seqtry (&mylock, seq));
Writers protect critical regions using write_seqlock() and write_sequnlock().
The 2.6 kernel introduced another mechanism called Read-Copy Update (RCU) that can yield improved performance in cases where readers far outnumber writers. The basic idea is that reader threads can execute without locking. Writer threads are more complex: each performs update operations on a copy of the data structure and replaces the pointer that readers will see. The original copy is maintained until the next context switch to ensure completion of all ongoing read operations. (Be aware that using RCU is more involved than using the primitives discussed thus far, and should be used only if you are sure that it’s the right tool for the job. There’s ample documentation in the kernel tree in Documentation/RCU/*.)
For an example on using RCU, look at fs/dcache.c. In Linux, each file is associated with directory entry information (stored in a structure called dentry), meta data information (stored in an inode), and actual data (stored in data blocks). Each time you operate on a file, the components in the file path are traversed and corresponding dentries are created. The dentries are kept cached in a data structure called the dcache to speed up future operations. At any time, the number of dcache lookups will be much more than dcache updates, so references to the dcache are protected using RCU primitives.


Concurrency-related problems are generally hard to debug since they are usually difficult to reproduce. It’s a good idea to enable SMP (CONFIG_SMP) and preemption (CONFIG_PREEMPT) while compiling and testing your code, even if your production kernel is going to run on a UP machine with preemption disabled. There is a configuration option under Kernel Hacking called Spinlock Debugging (CONFIG_DEBUG_SPINLOCK) that can help you catch some common spinlock errors. Also, tools like lockmeter (http://oss.sgi.com/projects/lockmeter/) can be used to collect lock-related statistics.
The most common concurrency problem occurs when you forget to lock an access to a shared resource. This results in different threads “racing” through that access in an unregulated manner. The problem (called a race condition) may appear in the form of occasional strange code behavior.
Another potential problem arises when you miss releasing held locks in certain code paths, resulting in deadlocks. To get a hang of this, consider the following example:
/* Acquire lock */
spin_lock (&mylock);

/* ... Critical Section ... */

/* Assume that error is very rare
* I forgot to release the lock!
if (error) {
return -EIO;

/* Release lock */
spin_unlock (&mylock);
After the occurrence of the error condition, any thread trying to acquire mylock gets deadlocked, and the kernel may freeze.
If the problem first manifests months or years after you write the code, it’s that much harder to go back and debug it. To avoid such unpleasant encounters, concurrency logic should be designed systematically at the beginning, when you architect your software.

Sreekrishnan Venkateswaran has been working for IBM India since 1996. His recent Linux projects include putting Linux onto a wristwatch, a PDA, and a pacemaker programmer. You can reach Krishnan at class="emailaddress">krishhna@gmail.com.

Comments are closed.