Porting Device Drivers to Linux 2.2: Part II

If you followed last issue's "Gearheads" column, all of your block and character devices should be running under Linux 2.2, albeit possibly with warnings about obsolete PCI interfaces. In this article, I will finish up with some of the smaller changes that may catch a driver author, cover networking, and then look at the new PCI layer.

If you followed last issue’s
“Gearheads” column, all of your block and character devices should be running
under Linux 2.2, albeit possibly with warnings about obsolete PCI interfaces. In
this article, I will finish up with some of the smaller changes that may catch a
driver author, cover networking, and then look at the new PCI layer.


if(skb->protocol == htons(ETH_P_MYPROTO))
/* The card requires we mask the addresses for this */
u8 v;

skb = skb_cow(skb, 0);
return 1; /* Whoops no memory */

I’ll start with the small stuff, since that is nice and easy. The first of
these is signal handling. Linux 2.2 has more signals as well as POSIX real-time
signal queues. This fact changes the driver code to determine whether a process
has received a signal.

In Linux 2.0, drivers check for signals directly. This is done with code such

if(current->signal &
return -EINTR;

which ensures that pressing Ctrl-C
on the terminal will return the EINTR error code from
the device driver function.

Linux 2.2 replaces this with a function which hides the implementation of
signals, which also means that we can avoid changing drivers again in the future.
The above code now becomes:

return -EINTR;

which is much cleaner.

The second, related issue is timeouts. Linux provides device drivers with
several ways to handle timeouts. The normal mechanism is to use the add_timer() and del_timer()
functions. It is also possible to sleep on a wait queue or reschedule with a

In Linux 2.0, the code for rescheduling with a timeout (essentially, to cause
the process to sleep for a certain delay within the kernel) was:

current->state =
current->timeout =
jiffies + MY_DELAY;

In Linux 2.2, this becomes:

current->state =

The same pattern is followed for sleeping on a wait queue. This is done quite
simply with:

timeout (wait_queue, MY_DELAY);

These changes to the timeout functions are done to improve scheduler
performance. Instead of the scheduling code spending time managing processes that
are almost never running, the scheduler and the timer handling are now split.

Porting Network Interfaces

Network driver functions have changed between Linux 2.0 and Linux 2.2. The
actual routines have changed little in terms of functionality, but the calling
conventions have changed considerably due to the addition of extensive SMP

The most obvious difference is that the functions for freeing buffers have
changed. In order to avoid memory-accounting errors (and to make the programmer’s
life easier) the network buffers now remember which resources they are using and
whether the buffer is in the sending or receiving path. This means that the
second FREE_READ or FREE_WRITE argument to dev_kfree_skb() is a thing of the

Secondly, the buffers handed to a network driver belong solely to that
driver. This gives the driver (almost) total freedom to play with the sk_ buff structure which it is handed. It can’t change skb-> data (which is shared) but it can play with the rest
of the object. It is best to avoid playing with the data anyway, as this
potentially requires a copy. However, you can use the skb_cow() function to obtain a private copy of the buffer. If
the buffer was already private it simply hands back the buffer you gave it,
taking virtually no time to execute. Thus you might use something like Figure

Another function provided to help device and protocol authors is skb_ realloc_headroom(). The kernel guarantees that when your
driver is passed a buffer, that buffer has at leastdev->hard_header_len bytes for the hardware headers. The
ether_ setup function sets this to 14 (for example),
which leaves space for the Ethernet header to fit.

Sometimes you get low-speed drivers that occasionally need a lot of header
space, and it’s undesirable to allocate the entire header space all of the time.
In these cases you can use:

skb = skb_realloc_headroom
(skb, 128);
if(skb == NULL)
/* Whoops no memory */
return 0;

to make a copy of the buffer if need be, which has at least 128 bytes of
space at the beginning. In general, you want to avoid this function as copies
impact performance. In some cases, such as tunnel devices, you can never be sure
how much header space is needed in advance (e.g., as a tunnel may itself be
tunneled). In these cases you set the device header length
to cover normal cases and bite the
overhead on the occasional unusual frame by using skb_realloc_ headroom().

Locking and SMP

The actual receive and transmit paths of most drivers are unchanged between
2.0 and 2.2. There are real changes in the handling of ARP and headers but these
are invisible to most drivers as they use the standard setup functions.

The interaction between receives and transmits has changed considerably. In
Linux 2.0, the SMP lock ensured that the transmit and receive paths never ran in
parallel. A receive interrupt might well occur during a transmit, but the
opposite was never true.

In 2.2, it is quite likely on a multiprocessor machine that the transmit and
receive paths will run at the same time. While this is good for performance, it
does mean that driver authors may need to manage locks explicitly. Modern
Ethernet controllers are sometimes designed to make this easy, but not

Most drivers that need to do some locking use spin locks (discussed in the
previous issue). The simple changes applied to most drivers are:


The 3c509 driver in 2.2 has the following structure:

struct el3_private {
struct enet_statistics stats;
struct device *next_dev;
spinlock_t lock; /* The device lock */
int head, size;
struct sk_buff *queue[SKB_QUEUE_SIZE];
char mca_slot;

* Adding a spin lock to the device private structure, as in Figure 2A.

* Initializing the lock when the device is probed (in device_probe)



if (test_and_set_bit(0, (void*)&dev->tbusy) != 0)
printk(“%s: Transmitter access conflict.\n”, dev->name);
else {
spin_lock_irqsave(&lp->lock, flags);

/* Transmit code */


lp = (struct el3_private *)dev->priv;
if (dev->interrupt)
printk(“%s: Re-entering the interrupt handler.\n”,
dev->interrupt = 1

* Setting the lock in the transmit function, as in Figure 2B.

* Using the lock in the interrupt
handler, as in Figure 2C.

The above usage of locks enforces a single threading between the transmit and
receive paths if required by the device. If you can avoid such locking, it’s best
to do so, especially on devicescapable of full-duplex networking. Avoiding locks
means you can be simultaneously sending data on one processor and receiving data
on another.

Two other functions which have not changed in themselves are, however,
tangled up in the locking. The first is the get_stats()
function. This is called whenever a user asks for statistics on the device — for
example, through ifconfig or the /proc/
file. It is quite common for the statistics function to
query the card itself — often the card maintains the counters rather than the
driver. Therefore, get_stats() may need to be locked
against the transmit and receive paths to prevent conflicts.


static struct enet_statistics *el3_get_stats(struct device *dev)
struct el3_private *lp = (struct el3_private *)dev->priv;
unsigned long flags;
spin_lock_irqsave(&lp->lock, flags);
spin_unlock_irqrestore(&lp->lock, flags);
return &lp->stats;


The example in Figure 3 is from the 3c509 driver where the statistics query
cannot be done during a transmit or receive. Here, you can see the statistics
update function is guarded by the device spin lock ensuring that all three of the
statistics, transmit, and receive paths are serialized — that is, only one of
the three is executed at any given time.


static void set_multicast_list(struct device *dev)
unsigned long flags;
struct el3_private *lp = (struct el3_private *)dev->priv;
int ioaddr = dev->base_addr;

spin_lock_irqsave(&lp->lock, flags);
if (dev->flags&IFF_PROMISC) {
outw(SetRxFilter | RxStation | RxMulticast |
RxBroadcast | RxProm, ioaddr + EL3_CMD);
else if (dev->mc_count || (dev->flags&IFF_ALLMULTI)) {
ioaddr + EL3_CMD);
outw(SetRxFilter | RxStation | RxBroadcast, ioaddr +EL3_CMD);
spin_unlock_irqrestore(&lp->lock, flags);

The final function that tends to get involved with SMP locking is the
multicast list update. This can be called from both a user process updating its
multicast listening list and also from the IPv6 network layer. On some cards,
updating the multicast list requires you to stop transmit and receive, and
perhaps prevent statistics querying. Again, a spin lock can be used to ensure
this. The example in Figure 4 is from the 3c509 driver as well.

By now you are probably thinking that the kernel is out to get you. It does,
however, provide a set of sensible guarantees to eliminate most headaches:

* An interrupt handler will not be re-entered while running. This means you
will not get two processors trying to receive packets at the same time.

* The sending function is single threaded. The kernel will not pass you any
pack-ets to send while you are executing your packet transmission function. It
will wait for you to return and then feed you the next packet if you are ready
for it.

Nevertheless, you do need to be aware of the fact that on a four-processor
machine you may be running a get_stats, a multicast
update, a receive and a transmit at the same time. The locks suggested should get
your driver working. Optimizing it beyond that really needs an SMP machine and a
lot of testing.

If your driver uses the common core drivers for things like the NS8390
(8390.o), the core driver modules handle SMP locking. In the case of the 8390
driver this is very good news for driver authors as the chip was not designed for
SMP use. In fact, at times it appears to have been designed to prevent SMP use,
mostly due to its age!

Header Handling

Header caches and ARP handling have changed significantly since Linux 2.0.
ARP is a protocol used by many networking layers to discover other IP hosts.
Physical networks such as Ethernet use their own addressing scheme and it is thus
necessary to map an IP address to an Ethernet address before sending any packets.
ARP solves this through the simple approach of broadcasting messages such as,
“Whoever has ( own up and tell me the Ethernet address for (”.
The results are then cached by the kernel. You can inspect this cache through

Because most drivers use an existing protocol layer for their physical
headers, the header cache and ARP changes are not issues for most driver authors.
The Ethernet, FDDI and token ring setup functions(init_ethdev (),
etc.) are already covering the changes.

If you do need to touch these layers, all you probably
need to know is that
while the build_header() function behaves as it did in Linux 2.0, the rebuild() function has changed. Previously this function
passed a whole series of mostly unnecessary parameters to the driver. Now it
passes only the buffer. This makes sense because the other fields you need are
the device and the data pointers, which can be obtained from the buffer


int eth_rebuild_header(void *buff, struct device *dev, unsigned long
struct sk_buff *skb)
struct ethhdr *eth = (struct ethhdr *)buff;
/* … */


int eth_rebuild_header(struct sk_buff *skb)
struct ethhdr *eth = (struct ethhdr *)skb->data;
struct device *dev = skb->dev;
/* … */

Thus, Figure 5A becomes like Figure 5B, which takes the other parameters from
the packet itself.

If you look at the Ethernet layer as a good example (net/ethernet/ eth.c) you will see that the kernel ARP
functions have also been cleaned up in the same way. The only arguments now
passed around are:

arp_find(u8 *where,

where skb is the buffer we are trying to complete an
ARP query for, and where is the place within that buffer
to put the answer.

Final Cleanup

The last small piece that has changed with network drivers is the statistics
structure. Previously called struct enet_statistics this
structure is now called struct net_device_ stats to
reflect its more generic nature. Using the old name is fine for now, but that may
break in 2.3.

Also, the stats structure now contains byte counts, so you will want to go
over your driver and add code to update tx_bytes and
rx_bytes when you update tx_packets

and rx_packets. These extra byte counters are needed
for accurate SNMP network monitoring of Linux boxes.

The Linux 2.2 PCI Layer

Now it’s time to look at tidying up the PCI usage in 2.2 drivers.

The PCI code in 2.2 changed for a good reason. In Linux 2.0, PCI basically
meant x86, or to a limited extent, Alpha. Only the Intel x86 has the PCI BIOS
interface provided by the kernel. With Linux 2.2, you can be using PCI devices on
numerous platforms, including some bus layouts that the PCI BIOS does not

Therefore, the kernel provides an abstract PCI layer that is built on top of
architecture-dependent code. On the x86 this includes both direct PCI and PCI
BIOS access. On other platforms (such as the PowerPC) this is done by talking to
the boot ROMs and directly to the PCI bus.

Linux 2.2 builds a list of PCI devices at boot time. Each entry is a struct pci_dev, which contains PCI configuration information
about the device.

The PCI bus functions take astructpci_dev pointer as
an argument, enabling strange bus architectures to be hidden from the device
driver. To a driver, a PCI device is almost a platform-independent object.

Under Linux 2.0 a program using PCI would use

return -ENODEV;

to check if the PCI services existed. On Linux 2.2, this becomes

return -ENODEV;

After this, you scan the bus looking for your device. A PCI device has a vendor
and device identifier that are unique for each different type of card.


unsigned char bus, devfn;
int index=0;

while(!(pcibios_find_device(MY_PCI_VENDOR, MY_PCI_DEVICE, index++,
&bus, &devfn)))
/* Check this device */
/* … */


struct pci_dev *pdev = NULL;

while((pdev=pci_find_device(MY_PCI_VENDOR, MY_PCI_DEVICE,pdev)! = NULL))
/* Check this device */
/* … */

Under Linux 2.0, drivers would use something like Figure 6A to walk
systematically through all matching cards. In Linux 2.2 the code is very similar
to that in Figure 6B.

As you can see, Linux 2.0 uses a counter to walk through the device list and
refers to devices by their bus and device-function identifiers, which is how PCI
is addressed at the device level. Linux 2.2 uses the struct
instead, which hides all sorts of mysteries and sins that
may be in the underlying PCI architecture.

The initial assignment of

struct pci_dev *pdev = NULL;

is done because NULL means “start from the
beginning” when passed as the third argument to pci_find_

Once you have a handle on your PCI device you have access to its memory, I/O
and IRQ assignment. In PCI, these are encoded in what are known as the Base
Address Registers (BAR registers). Each of these may be used to hold either an
I/O or memory address as well as its properties.


struct pci_dev *pdev;
/* … */
membase = (pdev->base_address[0] & PCI_BASE_ADDRESS_MEM_MASK);
mydev->mem = ioremap(membase, MY_DEVICE_SIZE);

If your card manual says “base address register 0 specifies the memory
address for the card”, you would use something like Figure 7.

For I/O space (rather than memory space), PCI_BASE_ADDRESS_IO_MASK is used instead of PCI_BASE_ADDRESS_MEM_ MASK.

The interrupt line for the card is found in pdev->irq. Each card has only one interrupt but this may
be shared between devices on the card and between cards. The interrupt will have
been assigned for you at boot time, either by the BIOS or boot ROM, or by the
kernel itself.

PCI devices which are capable of generating bus read/write requests
themselves (say, to access host memory or another PCI device) are called “bus
masters”. If a card is bus-mastering, it is up to the driver to set the bus
master flag in the PCI configuration register of the board. This is such a common
operation that the function pci_set_master(pdev) is
provided to do this.

From the above information you can find and map both memory and I/O spaces on
a PCI card. If you’ve ever read a PCI card manual or looked at Linux 2.0 PCI code
you will see there is a third PCI address space — the “configuration space”. It
contains a mix of vendor-specific and standard registers that can be read and
sometimes written.

The registers holding the vendor and device ID, which are used to find your
card on the bus, are examples of configuration space register. Another example is
the BAR registers themselves (which themselves point to memory or I/O space).


error = pci_read_config_byte(struct pci_dev *, u8 where, u8 *val);
error = pci_read_config_word(struct pci_dev *, u8 where, u16 *val);
error = pci_read_config_dword(struct pci_dev *, u8 where, u32 *val);

error = pci_write_config_byte(struct pci_dev *, u8 where, u8 val);
error = pci_write_config_word(struct pci_dev *, u8 where, u16 val);
error = pci_write_config_dword(struct pci_dev *, u8 where, u32 val);

Linux 2.2 provides functions to read and write byte, word (16-bit) and dword
(32-bit) values in the PCI configuration space. These have a straightforward
mapping to the Linux 2.0 PCI BIOS functions. They can be seen in Figure 8.

Here, where is the address (from 0-255) in the
configuration space to access, and val is the value to
read or write. In Linux 2.0, these functions looked like:

error =
pcibios_read_config_word (
u8 bus,
u8 devfn,
u8 where,
u8 *value)

and so forth. This makes porting Linux 2.0 to 2.2 PCI configuration handling
fairly painless to do.

You may notice that in Linux 2.0, some drivers used the PCI BIOS functions
directly in ways that you now want to avoid. Directly accessing the PCI
configuration space (e.g., for reading IRQs, BAR registers, and setting the bus
master flag) was required in version 2.0.

In Linux 2.2, however, it may be the case that the configuration space values
don’t match those found in the structpci_dev structure.
This is because the kernel knows about things such as interrupt re-mapping on
non-x86 hardware and has been quietly fiddling with these values behind your
back. In short, you should always use the values in structpci_dev, and not probe the PCI configuration space
directly if you can help it.

Alan Cox is a well-known Linux hacker. He is currently working on the
development of drivers, security auditing, Linux/SGI porting, and modular sound.
He can be reached at alan@lxorguk.ukuu.org.uk.

Comments are closed.