In last month's article, we looked at writing a basic Linux SCSI driver -- one that basically sucked. Actually, this driver was worse than planned because it contained a bug which anyone running on an SMP box would have found pretty quickly.
In last month’s article, we looked at writing a basic Linux SCSI driver — one that basically sucked. Actually, this driver was worse than planned because it contained a bug which anyone running on an SMP box would have found pretty quickly. In this month’s article I’ll flesh out the basic SCSI driver a bit, and then cover two additional topics: the kernel execution environment and writing “portable” drivers.
Fixing the SMP Bug
|Clean Separation: The Linux request queue.|
The (intentional) bug in the code from last month’s article relates to the io_request lock in Linux kernel version 2.2. First, a bit of background: Linux maintains a clean separation between file system code (such as ext2fs and iso9660) and block device code (such as disks and CD-ROMs) upon which filesystems rely.
Filesystems add entries to a fixed-length request queue in the kernel. These requests contain information on blocks to be read or written (such as location and size of buffers). The request queue code sorts the requests so that disk head movement will be minimized; this is important as disk head movement is one of the slowest components of disk I/O. The request queue may also merge multiple request together, such as turning four separate requests for sequential 1KB blocks into a single 4KB request. Once this reordering has been done, the request queue dispatches requests to the underlying block device drivers.
It is quite important, especially on an SMP machine, that two pieces of code do not mess with the request queue at the same time. If they do, you’ll see fun console messages such as “request queue destroyed” and the machine will probably hang. To prevent this, the queue is protected by a lock that must be claimed when code wants to fiddle with it.
To make life easier, block drivers are called with the request queue lock already held. If it has to do any serious work not involving the queue, a driver can improve performance and increase concurrency if it drops the lock; this allows other threads to play with the request queue while it’s not being used by the driver.
There are two problems with our original driver. First, it’s not being “polite” by dropping the lock as soon as it can. Second, the interrupt handler fails to claim the lock when a SCSI command has completed.
Listing One shows an abridged version of the original code for our driver’s interrupt handler.
Listing One: Original IRQ Handler Code
int my_irq_handler(int irq, void *dev_id, struct pt_regs *regs)
struct Scsi_Host *shpnt = dev_id;
int io = shpnt->io_port;
data = inw(io + READ_STATUS);
if(data & RESET_DONE)
current_command->result = DID_RESET << 16;
/* … */
What is not immediately obvious is that calling
current_command-> scsi_done not only informs the SCSI layer that a request is finished, but also places the entry back into the request queue and wakes up those tasks waiting for the I/O to complete.
To claim the request queue lock before calling the scsi_ done handler, we change Listing One to read:
if(data & RESET_DONE)
current_command->result = DID_RESET << 16;
spin_lock is used to claim the lock as we are in an interrupt handler at this point. Non-interrupt handler code (e.g., for a polled device) would need to use:
spin_lock_irq and spin_unlock_irq disable (and enable, respectively) local CPU interrupts to prevent deadlock. Within an interrupt handler, interrupts are already disabled, so there’s no need to do this.
So, the bug has been fixed in our driver, but it’s still pretty grotty. If our SCSI card is equivalently dumb, as some are, then there is nothing else we can do. SCSI cards are rarely this dumb, however, unless supplied free with scanners.
There are two main ways of improving SCSI performance: issue more commands at a time or do more work per command. Ideally, you want to do both.
In order to get more work done per command, we can ask the SCSI layer to cluster commands together and to use so-called “scatter/gather” lists to describe buffers in memory. Being able to use scatter/gather is device-dependent but almost any good device supports it.
Our simple driver is given commands that specify a size and a buffer location in which to place data. A large buffer in user memory is often not physically contiguous, so by specifying a single buffer there is no way to merge writes or reads involving multiple pages of memory.
By informing the kernel SCSI layer that our driver supports scatter/gather, the SCSI layer can merge more reads and writes. Instead of a single buffer address and length, we receive a list of (address, length) values. We can either load this list into the SCSI controller (if it’s a smart controller), or implement scatter/gather in the driver ourselves.
Our original code for receiving the data block from the device was:
len = current_command->request_bufflen;
insw(port + DATA_FIFO, current_command->
To support scatter/gather mode, we need to expand the insw() call to cope with multiple buffer fragments. The SCSI layer sets command->use_sg to 0 for non-scatter/ gather commands, and to the number of list elements in the scatter/gather case. If the command uses scatter/gather,command->request_buffer will point to a struct sglist of (base, length) pairs. Our new code can be seen in Listing Two.
Listing Two: Addition of Scatter/Gather Code
to the Interrupt Handler
if(current_command->use_sg == 0)
insw(port + DATA_FIFO, current_command->request_buffer,
int i = current_command->use_sg;
struct sglist *sg = current_command->request_buffer;
len = 0;
len += sg->length;
insw(port + DATA_FIFO, sg->address, sg->length);
So, if we are passed a scatter/gather list, we walk down the list copying each fragment from the device until we have finished. Now instead of eight 4K I/O requests we will get a single 32K request which is split into eight sglist entries.
Finally, we need to update the MYSCSI template to indicate that our driver supports scatter/gather. We change:
to indicate that any kind of scatter/gather will do. Rather than using SG_ALL, you can specify a limit on the number of blocks which can be merged into a single scatter/gather list. Most controllers that support bus-mastering DMA (i.e., good ones) can only handle a limited scatter/gather list. Setting the sg_tablesize value to the card limit ensures the kernel will never pass a request beyond the controller’s ability to cope. This limit can range quite a bit — for example, it is 127 for the newer Symbios Logic controllers, but only eight for the Adaptec 1542 ISA bus controllers.
Queuing More Commands
Most SCSI controllers and devices can handle multiple commands at a time. It is fairly hard to provide code to illustrate this as the functionality is very card-dependent. To enable multiple commands, set the cmds_per_lun value in the MYSCSI driver template, as so:
Here, 3 is an example value specifying that three simultaneous commands can be handled by each device hanging off of this controller, meaning the driver must figure out which device needs servicing and which command we are dealing with. A smart controller can do all the work for you.
Queuing multiple commands is a huge performance win even if only two or three are queued at once. It means that a command is almost always queued, so that the disk has something to do while the kernel is figuring out what command to issue next. The drive and the PC are now processing I/O requests in parallel.
The Kernel Execution Environment
Linux kernel modules such as SCSI drivers execute in “supervisor mode” on the CPU. The precise meaning of supervisor mode depends on the CPU architecture. On an Intel x86, it means that kernel code executes in “ring 0,” which is the most privileged CPU mode.
The practical implications of supervisor mode are simply stated. You can write to the wrong memory location, you can write to the wrong I/O space, and you can do things like turn interrupts off and leave them disabled. In short, you are in a state where the processor trusts you. It is very important, therefore, that device drivers are carefully checked for correctness as well as thoroughly tested.
The Linux kernel itself restricts the facilities that a kernel module can use. For one, it inhibits the use of floating point arithmetic in kernel space. It is possible to use floating point, but the cost of doing this is measurable and it only works on machines with a real floating point chip. This is not for general use, however, and never allowable from an interrupt.
A second restriction is that the kernel runs with a stack that is a bit less than 8K long. Since chunks of that can be used for interrupt handling, you should be looking to keep stack usage under 2K by allocating memory objects from other places than the stack and by avoiding excessive recursion. There are very good reasons for not increasing the kernel stack size beyond 8K. Primarily, this is done to save memory. With one stack created per process, going to a 16K stack would make 200 processes cost 1.5Mb of memory.
Finally, many kernel functions cannot be called within an interrupt handler. In particular, you cannot sleep in an interrupt, so any kind of semaphore or waiting for an event needs to occur elsewhere. An interrupt handler can put off handling an event or spin waiting for it. As spin waiting for an event locks the machine until the event occurs, this should be avoided if at all possible.
Memory allocation from an interrupt must be done with kmalloc using the GFP_ATOMIC flag. This tells the kernel not to wait for memory to become free if it can’t service the request immediately. Normally the kernel will swap pages of memory to make room for a new allocation, but this is not possible in an interrupt handler, so a GFP_ATOMIC request will simply return NULL if not enough memory is there.
While the kernel is processing an interrupt, it will process no other interrupts or events of any kind. Interrupt handlers should therefore be as short as possible. A mechanism called “bottom half handlers” (or “bh handlers”) exists to execute code after an interrupt but before returning to user space. This allows the bottom-half handler to do hard work while still allowing other interrupts to occur. The bottom halves do a great deal of work in Linux, including much of the networking receive code, all the timer handling, and much of the terminal driver.
All the restrictions of an interrupt handler apply to a bottom half handler, except that interrupts are enabled. To simplify life, bottom half handlers make what are called atomicity guarantees: You are guaranteed that a bottom half handler will not be executed again while it is already running, and you are guaranteed that two bottom halves will never run at the same time. However, this latter guarantee should not be exploited as it may well cease to be true in future kernels.
A slow interrupt handler should use a bottom half handler. Listing Three is an example of queuing a bottom-half handler from an interrupt handler.This causes the task defined by my_task to be executed as soon as the interrupt handler completes. In the above example, this causes the function my_slow_handler to be called. Other interrupts (even another instance of my_irq) can occur while the bottom half is executing.
Listing Three: Queuing a Bottom Half Interrupt Handler
static struct tqueue my_task;
int my_irq(int irq, void *dev_id, struct pt_regs *regs)
struct my_device *dev=dev_id;
dev->info = inb(dev->port); /* Clears IRQ too */
my_task.routine = my_slow_handler;
There are two standard task queues that can be used with the queue_task routine. Using tq_immediate and doing a mark_bh for IMMEDIATE_BH ensures that the task is run immediately upon completion of the interrupt handler. Often, however, you want to handle interrupts immediately, perhaps saving some information from each interrupt, and do aggregate processing in bottom half. The tq_timer queue allows this; it causes the bottom half to be run on the next timer tick. The serial driver code uses this feature; bytes are collected from the hardware at every interrupt, but a block of received data is fed into the terminal driver layer of the kernel at every timer tick (normally every 1/100th of a second). This aggregation of work reduces the cost of each event and improves overall throughput at a small latency cost.
Writing Portable Drivers
Writing a portable driver would be far too easy if all machines where similar. So an army of very bright hardware engineers have invented a variety of clever ways to make life interesting for programmers. If you are writing a kernel driver that might be useful on more than one architecture (and there are very few cases where it will not be) you need to be aware of the traps the hardware people have set.
What Size is a Long Today?
One of the most obvious things to keep in mind is that the size of types like short and long is dependent on the compiler. Assuming a long is 32 bits wide will lead to problems on the UltraSPARC or Alpha machines where it’s 64 bits. To simplify, the kernel defines a set of well-defined data types:
s8 signed 8bit value
s16 signed 16bit value
s32 signed 32bit value
s64 signed 64bit value
u8 unsigned 8bit value
u16 unsigned 16bit value
u32 unsigned 32bit value
u64 unsigned 64bit value
The standard types in these terms on the common platforms are shown in Table 1. The MIPS and ARM processors specify char as an unsigned type as well, just to make life more interesting.
Table 1: Length of Standard Types
| char|| short|| int|| long|
| Alpha|| 8 bit|| 16 bit|| 32 bit|| 64 bit|
| ARM|| 8 bit|| 16 bit|| 32 bit|| 32 bit|
| Intel|| 8 bit|| 16 bit|| 32 bit|| 32 bit|
| M68K|| 8 bit|| 16 bit|| 32 bit|| 32 bit|
| MIPS|| 8 bit|| 16 bit || 32 bit|| 32 bit|
| SPARC|| 8 bit|| 16 bit|| 32 bit|| 32 bit|
| UltraSPARC|| 8 bit|| 16 bit|| 32 bit|| 64 bit|
You must also avoid making assumptions about the alignments of types. The compiler will pack structures differently according to the processor. If the actual structure alignment and size matters you should put any padding in explicitly and use the gcc extension __attribute ((packed)) to tell the compiler that it is not to pad out your structure. Use this with care. Some machines will run very slowly and use software-trapped bus errors to emulate unaligned reads and writes when you attempt to reference data that is not aligned properly.
I have saved the greatest part of the type conspiracy until last: The infamous “endianness” affair. Currently all supported platforms are either big-endian (that is, the most significant byte of a word is stored first) or little-endian (least significant byte first). You can test the endianness in your driver using the defines __LITTLE_ENDIAN and __BIG_ENDIAN.
To make life easier, there are a set of macros that do conversions when necessary. These are cpu_to_le32, cpu_ to_be32, be32_to_cpu, and le32_to_cpu. A similar set of macros are present for 16 and 64 bits.
Using these macros avoids having to test the endianness via defines, and also generates good code as the macros expand to assembler inlines where appropriate. The functions that do nothing are macro definitions that are replaced with the original value, so there is no overhead at all in such cases.
Writing a driver that runs on all platforms is hard, especially if you can’t test it on all architectures. Realistically, you should not expect to succeed. However, by considering the issues of size and endianness, you make things much easier for whomever ends up porting your driver. And a little consideration is a Good Thing indeed.
Alan Cox is a well-known Linux hacker. He is currently working on the development of drivers, Linux/ SGI porting, and modular sound. He can be reached at [email protected].