Spurred by data hungry applications and legislature mandating long periods of data retention, IT departments are doubling storage requirements each year. To keep up with the demand and to simplify the management of such enormous volumes of information, many organizations have adopted storage area networks (SANs) based on Fibre Channel, which consolidates disparate hardware into a shared storage pool. And while Fibre Channel is complicated and expensive, its speed has historically surpassed other networking solutions, justifying its higher price tag.
But as the SAN market grew using 2 Gbps Fibre Channel, Ethernet attained higher and higher speeds. Today, off-the-shelf servers are equipped with multiple 1 Gbps Ethernet interfaces, and 10 Gbps network interface cards are available. Due to the high volumes at which Ethernet is manufactured hardware such as multi-port Gigabit Ethernet switches are hundreds of dollars, as opposed to thousands of dollars for Fibre Channel switches. Ethernet is now fast enough for the most demanding networked storage applications at a very reasonable cost.
Over the past decade, low cost hard disks have also undergone dramatic improvements. ATA/IDE disks are now available with storage capacity of 400 GB and mean time between failure (MTBF) ratings of 1 million hours, all at rock bottom prices.
So, what can one build from low-cost networking gear, high-capacity hard drives, and a little bit of software? ATA over Ethernet (AoE) — a SAN at a fraction of the cost of Fibre Channel.
ATA over Ethernet
ATA over Ethernet (AoE) is an open standards-based protocol designed to efficiently transfer ATA disk commands over Ethernet. Unlike iSCSI, AoE is a thin-layer protocol trafficked atop Ethernet. (See Figure One for a comparison.) AoE encapsulates standard ATA disk commands directly into Ethernet frames, providing a low latency, low overhead protocol that connects servers to block level storage over a standard Ethernet connection.
Since AoE doesn’t require processing of a TCP/IP stack, no TCP offload engine (TOE) is required to achieve good performance, and no special Host Bus Adapter is required to interface to the storage network. Also, AoE is easily decoded and it’s not a routable protocol.
AoE handles packet retransmission, and each packet sent has a positive acknowledgement. Using modern Ethernet switches that support flow control, zero AoE packet retransmission is achieved, yet the protocol works in high packet loss environments where retransmission may be required.
The AoE protocol includes user customization features and provides a means for target device discovery and storage device physical location parameters. These features are especially important in large storage arrays consisting of thousands of storage devices.
Following Protocol
AoE is a tagged, client command/server response protocol. Each command and response is contained in a single Ethernet frame. AoE does not use IP; instead, network addressing is done using MAC addresses. (AoE has an IEEE registered Ethernet type of 0×88a2.)
To mitigate the complexity of using MAC addresses to manage storage devices, AoE uses an aoemajor/aoeminor abstraction in addition to the Ethernet address. This simple approach allows the system administrator to manage AoE devices based on an ordered heirarchy instead of their MAC address.
Figure Two shows the header of all AoE packets. In the figure, the horizontal axis labels bits and the vertical axis labels bytes. Bytes 0 through 13 are the standard Ethernet header. (For an explanation of the
Ver,
Flags, and
Error fields, please see the AoE specification at
http://www.coraid.com/documents/AoEr8.txt.) The contents of the
Arg field vary based on the command specified in the
Command field.
The AoE protocol currently supports two commands: (conceptually) Command 0 is “Issue ATA Command,” and Command 1 is “Query Config Information.”
Command 0 is used to issue an ATA command to the AoE server’s attached ATA device. Arg includes the ATA device registers used for specifying the command (for example, read or write) and the associated data. Since AoE packets are not permitted to span Ethernet frames, the maximum ATA transaction size using AoE is 1 KB. This does not adversely affect Linux filesystem block size, though, as the AoE driver breaks large system requests into multiple 1 KB AoE requests.
Command 1 is used to manage and discover AoE devices. The primary client method for device discovery is to broadcast a Query Config command so that all AoE devices on the network can respond and be enumerated. Command 1 can also be used to store up to 1 KB of information on the AoE device (not on the ATA disk) for more fine-grained access control.
And that’s it. Instead of issuing ATA commands to a locally attached disk, commands are wrapped up in an Ethernet frame and sent over the network to a remote server.
The System Side
The AoE driver in the Linux kernel discovers AoE devices on the network, registers them with the system, translates system read/write requests into AoE ATA read/write requests, retransmits AoE commands when responses do not return in time, and fails devices to the system when the devices stop responding.
Outside of the Linux kernel, the root user can interact with the driver to list known devices, trigger a discover beacon (broadcast a Query Config command), or restrict the set of interfaces valid for AoE traffic. By default, all interfaces are valid for traffic.
Figure Three shows a set of commands to load the aoe module, send a discover beacon, and list the devices discovered. The commands aoe-discover and aoe-stat are shell scripts distributed with the current 2.6 AoE driver. The system devices are named e aoemajor.aoeminor, corresponding to the aoemajor/aoeminor that the AoE device declares.
As Figure Three shows, running aoe-discover yields a number of AoE devices, which are subsequently accessible from their appropriately named device nodes in /dev/etherd/. Although the disks are remote, they behave just as if they were locally attached.
The EtherDrive Blade
Coraid’s EtherDrive Storage Blade is one of the first products to use the AoE protocol to provide networked storage. With EtherDrive Storage Blades, a complete storage system can be assembled for less than $2/GByte, including ATA disks and the Ethernet networking fabric. To compare, the industry norm for SAN systems using Fibre Channel is approximately $20/GByte.
Each Coraid EtherDrive storage blade is a nanoserver with a CPU, RAM, and interfaces for Ethernet and ATA. Each blade acts as as an AoE device server. The blade slides into one of ten slots in a 3U high rack-mountable shelf. (See Figure Four.) The EtherDrive shelf has a physically settable address via a rear dipswitch and can have any value between 0 and 4095. As each shelf accessed from a host must have a unique address, this gives a potential store of 40,950 disks. Each EtherDrive Storage Blade’s data transfer throughput and IOPS are completely independent from the other blades. By using RAID to concurrently access multiple EtherDrive blades as a single data store, a server can achieve any desired performance.
When the blade is inserted into the shelf, it queries the shelf for two numbers: the shelf address and the slot the blade is operating in. The blade uses this shelf/slot address as its aoemajor/aoeminor. By doing this, the aoemajor/aoeminor conceptual addressing of network disks is simplified to a physical location. The system administrator can think of device node /dev/etherd/e0.1 as the disk in shelf 0, slot 1.
AoE Storage Example
Two Linux tools, mdadm and LVM, make it very easy to use EtherDrive storage blades for redundant, expandable storage. Mdadm is a tool for the administration of Linux kernel md (RAID) devices; LVM is a logical volume manager for creating abstractions on top of block device storage.
Let’s create a base installation using one shelf and expand it with another. The example uses RAID-0 stripes over each shelf for simplicity; in a production environment you’d likely want to choose a RAID level that offers redundancy (RAID-1+ 0 or RAID-5). This example uses the 2.6-5 AoE driver running on a 2.6.10 kernel; the LVM used in the 2.6 kernel series is LVM2.
Let’s assume that two projects need storage. Using LVM, let’s allocate some of shelf 5 to each, then expand project 1 using shelf 7. Each blade in shelves 5 and 7 has a 40 GB disk. Figure Five shows the commands to run.
First, use mdadm to create a raid-0 stripe over shelf 5. Then use the pvcreate to mark /dev/md0 as an LVM “physical volume.” The subsequent vgcreate command creates an LVM volume group called datastore, initially containing /dev/md0. Run vgdisplay to display the volume groups.
lvcreate creates logical volumes from the volume group, datastore. lvdisplay shows the logical volumes. Next, make filesystems on your logical volumes and mount them just for show.
The next mdadm and lvcreate commands should look familiar: you used them to initialize shelf 5 for use in our logical volumes.
To add /dev/md1 to the existing volume group, datastore, use vgextend. Next, run lvextend to grow the logical volume for project_1. At this point, the filesystem is not aware of the additional space available to it; since the filesystem is ext3, use the resize2fs command to grow the filesystem.
Finally, remount the filesystem to see that it has indeed expanded it as expected.
This example displays how easy storage is to manage and incrementally expand when based on three production-ready open source technologies: AoE, mdadm, and LVM.
Conclusion
Due to the advent of 1 Gbps and 10 Gbps Ethernet and its pervasive deployment, Ethernet based SANs can be created for a fraction of the cost of their Fibre Channel counterparts. Using AoE, standard ATA disks, and the Coraid EtherDrive storage blade, a dynamically expandable Ethernet SAN can be created for less than $2/GB. And since AoE is not based on TCP, the performance is great without the need for costly ToE boards.
As a final note, Coraid’s AoE driver for the 2.6 linux kernel has been accepted into the mainline distribution and is available in linux kernels 2.6.11 and up.
For other kernels, you can download a standalone loadable aoe module from the Coraid web site at www.coraid.com. The site also has useful HOWTO s and examples.
Sam Hopkins is a programmer for Coraid, Inc. He is the original author of the Linux and FreeBSD AoE device drivers and a co-author of the AoE protocol. He can be reached at
class="emailaddress">sah@coraid.com.
No comments yet.