The ATA over Ethernet (AoE) Protocol

ATA over Ethernet connects ATA disks to remote hosts via Ethernet, providing sites with a low-cost storage area network built from commodity components.
Spurred by data hungry applications and legislature mandating long periods of data retention, IT departments are doubling storage requirements each year. To keep up with the demand and to simplify the management of such enormous volumes of information, many organizations have adopted storage area networks (SANs) based on Fibre Channel, which consolidates disparate hardware into a shared storage pool. And while Fibre Channel is complicated and expensive, its speed has historically surpassed other networking solutions, justifying its higher price tag.
But as the SAN market grew using 2 Gbps Fibre Channel, Ethernet attained higher and higher speeds. Today, off-the-shelf servers are equipped with multiple 1 Gbps Ethernet interfaces, and 10 Gbps network interface cards are available. Due to the high volumes at which Ethernet is manufactured hardware such as multi-port Gigabit Ethernet switches are hundreds of dollars, as opposed to thousands of dollars for Fibre Channel switches. Ethernet is now fast enough for the most demanding networked storage applications at a very reasonable cost.
Over the past decade, low cost hard disks have also undergone dramatic improvements. ATA/IDE disks are now available with storage capacity of 400 GB and mean time between failure (MTBF) ratings of 1 million hours, all at rock bottom prices.
So, what can one build from low-cost networking gear, high-capacity hard drives, and a little bit of software? ATA over Ethernet (AoE) — a SAN at a fraction of the cost of Fibre Channel.

ATA over Ethernet

ATA over Ethernet (AoE) is an open standards-based protocol designed to efficiently transfer ATA disk commands over Ethernet. Unlike iSCSI, AoE is a thin-layer protocol trafficked atop Ethernet. (See Figure One for a comparison.) AoE encapsulates standard ATA disk commands directly into Ethernet frames, providing a low latency, low overhead protocol that connects servers to block level storage over a standard Ethernet connection.
FIGURE ONE: ATA over Ethernet is a thin-layer protocol, unlike iSCSI



Since AoE doesn’t require processing of a TCP/IP stack, no TCP offload engine (TOE) is required to achieve good performance, and no special Host Bus Adapter is required to interface to the storage network. Also, AoE is easily decoded and it’s not a routable protocol.
AoE handles packet retransmission, and each packet sent has a positive acknowledgement. Using modern Ethernet switches that support flow control, zero AoE packet retransmission is achieved, yet the protocol works in high packet loss environments where retransmission may be required.
The AoE protocol includes user customization features and provides a means for target device discovery and storage device physical location parameters. These features are especially important in large storage arrays consisting of thousands of storage devices.

Following Protocol

AoE is a tagged, client command/server response protocol. Each command and response is contained in a single Ethernet frame. AoE does not use IP; instead, network addressing is done using MAC addresses. (AoE has an IEEE registered Ethernet type of 0x88a2.)
To mitigate the complexity of using MAC addresses to manage storage devices, AoE uses an aoemajor/aoeminor abstraction in addition to the Ethernet address. This simple approach allows the system administrator to manage AoE devices based on an ordered heirarchy instead of their MAC address.
Figure Two shows the header of all AoE packets. In the figure, the horizontal axis labels bits and the vertical axis labels bytes. Bytes 0 through 13 are the standard Ethernet header. (For an explanation of the Ver, Flags, and Error fields, please see the AoE specification at http://www.coraid.com/documents/AoEr8.txt.) The contents of the Arg field vary based on the command specified in the Command field.
FIGURE ONE: The ATA over Ethernet header
        0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0 | Ethernet Destination Addr |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
4 | Ethernet Destination Addr | Ethernet Source Addr |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
8 | Ethernet Source Addr |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
12 | Ethernet Type (0x88A2) | Ver | Flags | Error |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
16 | Major | Minor | Command |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
20 | Tag |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
24 | Arg (n = |Arg|: 0 <= n <= 1476) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The AoE protocol currently supports two commands: (conceptually) Command 0 is “Issue ATA Command,” and Command 1 is “Query Config Information.”
Command 0 is used to issue an ATA command to the AoE server’s attached ATA device. Arg includes the ATA device registers used for specifying the command (for example, read or write) and the associated data. Since AoE packets are not permitted to span Ethernet frames, the maximum ATA transaction size using AoE is 1 KB. This does not adversely affect Linux filesystem block size, though, as the AoE driver breaks large system requests into multiple 1 KB AoE requests.
Command 1 is used to manage and discover AoE devices. The primary client method for device discovery is to broadcast a Query Config command so that all AoE devices on the network can respond and be enumerated. Command 1 can also be used to store up to 1 KB of information on the AoE device (not on the ATA disk) for more fine-grained access control.
And that’s it. Instead of issuing ATA commands to a locally attached disk, commands are wrapped up in an Ethernet frame and sent over the network to a remote server.

The System Side

The AoE driver in the Linux kernel discovers AoE devices on the network, registers them with the system, translates system read/write requests into AoE ATA read/write requests, retransmits AoE commands when responses do not return in time, and fails devices to the system when the devices stop responding.
Outside of the Linux kernel, the root user can interact with the driver to list known devices, trigger a discover beacon (broadcast a Query Config command), or restrict the set of interfaces valid for AoE traffic. By default, all interfaces are valid for traffic.
Figure Three shows a set of commands to load the aoe module, send a discover beacon, and list the devices discovered. The commands aoe-discover and aoe-stat are shell scripts distributed with the current 2.6 AoE driver. The system devices are named e aoemajor.aoeminor, corresponding to the aoemajor/aoeminor that the AoE device declares.
Figure Three: Linux commands to discover ATA over Ethernet devices
% modprobe aoe
% aoe-discover
% aoe-stat
e0.0 eth1 up
e0.1 eth1 up
e0.2 eth1 up
e0.3 eth1 up
e0.4 eth1 up
e0.5 eth1 up
e0.6 eth1 up
e0.7 eth1 up
e0.8 eth1 up
e0.9 eth1 up
% ls –l /dev/etherd/e0.[0-9]
brw-rw---- 1 root root 152, 20 Feb 8 11:15 /dev/etherd/e0.0
brw-rw---- 1 root root 152, 21 Feb 8 11:15 /dev/etherd/e0.1
brw-rw---- 1 root root 152, 22 Feb 8 11:15 /dev/etherd/e0.2
brw-rw---- 1 root root 152, 23 Feb 8 11:15 /dev/etherd/e0.3
brw-rw---- 1 root root 152, 24 Feb 8 11:15 /dev/etherd/e0.4
brw-rw---- 1 root root 152, 25 Feb 8 11:15 /dev/etherd/e0.5
brw-rw---- 1 root root 152, 26 Feb 8 11:15 /dev/etherd/e0.6
brw-rw---- 1 root root 152, 27 Feb 8 11:15 /dev/etherd/e0.7
brw-rw---- 1 root root 152, 28 Feb 8 11:15 /dev/etherd/e0.8
brw-rw---- 1 root root 152, 29 Feb 8 11:15 /dev/etherd/e0.9
As Figure Three shows, running aoe-discover yields a number of AoE devices, which are subsequently accessible from their appropriately named device nodes in /dev/etherd/. Although the disks are remote, they behave just as if they were locally attached.

The EtherDrive Blade

Coraid’s EtherDrive Storage Blade is one of the first products to use the AoE protocol to provide networked storage. With EtherDrive Storage Blades, a complete storage system can be assembled for less than $2/GByte, including ATA disks and the Ethernet networking fabric. To compare, the industry norm for SAN systems using Fibre Channel is approximately $20/GByte.
Each Coraid EtherDrive storage blade is a nanoserver with a CPU, RAM, and interfaces for Ethernet and ATA. Each blade acts as as an AoE device server. The blade slides into one of ten slots in a 3U high rack-mountable shelf. (See Figure Four.) The EtherDrive shelf has a physically settable address via a rear dipswitch and can have any value between 0 and 4095. As each shelf accessed from a host must have a unique address, this gives a potential store of 40,950 disks. Each EtherDrive Storage Blade’s data transfer throughput and IOPS are completely independent from the other blades. By using RAID to concurrently access multiple EtherDrive blades as a single data store, a server can achieve any desired performance.
FIGURE FOUR: The Coraid EtherDrive ATA over Ethernet Storage Blade



When the blade is inserted into the shelf, it queries the shelf for two numbers: the shelf address and the slot the blade is operating in. The blade uses this shelf/slot address as its aoemajor/aoeminor. By doing this, the aoemajor/aoeminor conceptual addressing of network disks is simplified to a physical location. The system administrator can think of device node /dev/etherd/e0.1 as the disk in shelf 0, slot 1.

AoE Storage Example

Two Linux tools, mdadm and LVM, make it very easy to use EtherDrive storage blades for redundant, expandable storage. Mdadm is a tool for the administration of Linux kernel md (RAID) devices; LVM is a logical volume manager for creating abstractions on top of block device storage.
Let’s create a base installation using one shelf and expand it with another. The example uses RAID-0 stripes over each shelf for simplicity; in a production environment you’d likely want to choose a RAID level that offers redundancy (RAID-1+ 0 or RAID-5). This example uses the 2.6-5 AoE driver running on a 2.6.10 kernel; the LVM used in the 2.6 kernel series is LVM2.
Let’s assume that two projects need storage. Using LVM, let’s allocate some of shelf 5 to each, then expand project 1 using shelf 7. Each blade in shelves 5 and 7 has a 40 GB disk. Figure Five shows the commands to run.
First, use mdadm to create a raid-0 stripe over shelf 5. Then use the pvcreate to mark /dev/md0 as an LVM “physical volume.” The subsequent vgcreate command creates an LVM volume group called datastore, initially containing /dev/md0. Run vgdisplay to display the volume groups.
lvcreate creates logical volumes from the volume group, datastore. lvdisplay shows the logical volumes. Next, make filesystems on your logical volumes and mount them just for show.
The next mdadm and lvcreate commands should look familiar: you used them to initialize shelf 5 for use in our logical volumes.
To add /dev/md1 to the existing volume group, datastore, use vgextend. Next, run lvextend to grow the logical volume for project_1. At this point, the filesystem is not aware of the additional space available to it; since the filesystem is ext3, use the resize2fs command to grow the filesystem.
Finally, remount the filesystem to see that it has indeed expanded it as expected.
FIGURE FIVE: Connecting to an array of drives over ATA over Ethernet
# aoe-stat
e5.0 eth1 up
e5.1 eth1 up
e5.2 eth1 up
e5.3 eth1 up
e5.4 eth1 up
e5.5 eth1 up
e5.6 eth1 up
e5.7 eth1 up
e5.8 eth1 up
e5.9 eth1 up
e7.0 eth3 up
e7.1 eth3 up
e7.2 eth3 up
e7.3 eth3 up
e7.4 eth3 up
e7.5 eth3 up
e7.6 eth3 up
e7.7 eth3 up
e7.8 eth3 up
e7.9 eth3 up
# mdadm –C /dev/md0 –l 0 –n 10 /dev/etherd/e5.[0-9]
VERS = 9000
mdadm: array /dev/md0 started.
# pvcreate /dev/md0
Physical volume "/dev/md0" successfully created
# vgcreate datastore /dev/md0
Volume group "datastore" successfully created
# vgdisplay
--- Volume group ---
VG Name datastore
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 1
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 0
Open LV 0
Max PV 0
Cur PV 1
Act PV 1
VG Size 383.46 GB
PE Size 4.00 MB
Total PE 98166
Alloc PE / Size 0 / 0
Free PE / Size 98166 / 383.46 GB
VG UUID uWYiR6-KIBe-4ws5-MZEL-ic4S-dLlk-zzSQ2X

# lvcreate –L 100G –n project_1 datastore
Logical volume "project_1" created
# lvcreate –L 283G –n project_2 datastore
Logical volume "project_2" created
# lvdisplay
--- Logical volume ---
LV Name /dev/datastore/project_1
VG Name datastore
LV UUID 8grMnM-5u5N-yP0o-giQ0-zJon-y9Qo-9FNxbj
LV Write Access read/write
LV Status available
# open 1
LV Size 100.00 GB
Current LE 25600
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 254:0

--- Logical volume ---
LV Name /dev/datastore/project_2
VG Name datastore
LV UUID aqvycu-g1Wj-gqqv-qTD3-76ez-wKFd-t9s50z
LV Write Access read/write
LV Status available
# open 1
LV Size 283.00 GB
Current LE 72448
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 254:1

# ls –l /dev/datastore/project_[12]
lrwxrwxrwx 1 root root 31 Feb 15 15:08 /dev/datastore/project_1 ->
/dev/mapper/datastore-project_1
lrwxrwxrwx 1 root root 31 Feb 15 15:08 /dev/datastore/project_2 ->
/dev/mapper/datastore-project_2

# for i in /dev/datastore/project_[12]; do mkfs.ext3 $i & done
...
# mkdir –p /mnt/project_1 /mnt/project_2
# for i in project_1 project_2; do mount /dev/datastore/$i /mnt/$i; done
# df /mnt/project_[12]
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/datastore-project_1
103212320 32828 97936612 1% /mnt/project_1
/dev/mapper/datastore-project_2
292091008 32828 277220832 1% /mnt/project_2

# mdadm –C /dev/md1 –l 0 –n 10 /dev/etherd/e7.[0-9]
VERS = 9000
mdadm: array /dev/md1 started.

# pvcreate /dev/md1
Physical volume "/dev/md1" successfully created

# vgextend datastore /dev/md1
Volume group "datastore" successfully extended

# vgdisplay
--- Volume group ---
VG Name datastore
System ID
Format lvm2
Metadata Areas 2
Metadata Sequence No 8
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 2
Open LV 2
Max PV 0
Cur PV 2
Act PV 2
VG Size 766.92 GB
PE Size 4.00 MB
Total PE 196332
Alloc PE / Size 98048 / 383.00 GB
Free PE / Size 98284 / 383.92 GB
VG UUID uWYiR6-KIBe-4ws5-MZEL-ic4S-dLlk-zzSQ2X

# lvextend –L 283G /dev/datastore/project_1
Extending logical volume project_1 to 283.00 GB
Logical volume project_1 successfully resized

# umount /dev/datastore/project_1
# resize2fs /dev/datastore/project_1
resize2fs 1.35 (28-Feb-2004)
Please run ’e2fsck –f /dev/datastore/project_1’ first.

## we’re using an ext2 tool on an ext3 filesystem, so let’s do as he
asks
# e2fsck –f /dev/datastore/project_1
e2fsck 1.35 (28-Feb-2004)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/datastore/project_1: 11/13107200 files (0.0% non-contiguous),
419527/26214400 blocks
# resize2fs /dev/datastore/project_1
resize2fs 1.35 (28-Feb-2004)
Resizing the filesystem on /dev/datastore/project_1 to 74186752 (4k)
blocks.
The filesystem on /dev/datastore/project_1 is now 74186752 blocks long.

# mount /dev/datastore/project_1 /mnt/project_1
# df /mnt/project_[12]
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/datastore-project_1
292091008 32828 277220832 1% /mnt/project_1
/dev/mapper/datastore-project_2
292091008 32828 277220832 1% /mnt/project_2
# vgdisplay
--- Volume group ---
VG Name datastore
System ID
Format lvm2
Metadata Areas 2
Metadata Sequence No 10
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 2
Open LV 2
Max PV 0
Cur PV 2
Act PV 2
VG Size 766.92 GB
PE Size 4.00 MB
Total PE 196332
Alloc PE / Size 144896 / 566.00 GB
Free PE / Size 51436 / 200.92 GB
VG UUID uWYiR6-KIBe-4ws5-MZEL-ic4S-dLlk-zzSQ2X
This example displays how easy storage is to manage and incrementally expand when based on three production-ready open source technologies: AoE, mdadm, and LVM.

Conclusion

Due to the advent of 1 Gbps and 10 Gbps Ethernet and its pervasive deployment, Ethernet based SANs can be created for a fraction of the cost of their Fibre Channel counterparts. Using AoE, standard ATA disks, and the Coraid EtherDrive storage blade, a dynamically expandable Ethernet SAN can be created for less than $2/GB. And since AoE is not based on TCP, the performance is great without the need for costly ToE boards.
As a final note, Coraid’s AoE driver for the 2.6 linux kernel has been accepted into the mainline distribution and is available in linux kernels 2.6.11 and up.
For other kernels, you can download a standalone loadable aoe module from the Coraid web site at www.coraid.com. The site also has useful HOWTO s and examples.

Sam Hopkins is a programmer for Coraid, Inc. He is the original author of the Linux and FreeBSD AoE device drivers and a co-author of the AoE protocol. He can be reached at class="emailaddress">sah@coraid.com.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62