Storage Area Networks never really realized the promise of providing a shared pool of storage. IBM's Storage Tank delivers on that promise - and more. Here's an exclusive, in-depth look at how Storage Tank works
By now, most of you are probably familiar with the promise of storage area networks (SANs): put all your storage in a heap in the center of the room and all your servers around the edges. In this configuration, the storage heap looks like the physically-attached disks the servers are used to, but all of the servers see all of the same disks all the time. According to the ideal, SAN is supposed to cheaper and simpler, too. With SAN, storage doesn’t have to be divided up among the servers, and shadowing and workload partitioning are unnecessary. And, of course, the SAN ideal provides the speed of conventional, locally-attached disks.
Indeed, sites are using SANs today to get these benefits. You can move storage space from one server to another, or from your reserve pool, without leaving your desk and without powering down. You can have a multitude of servers use a single copy of some data.
But the reality stops far short of what you might envision as a shared pool of storage. Though you can move storage around just by updating configuration files, you still have to manage the space.
You still have to figure out how many blocks of what kind of space each server needs, and you must constantly move it around to stay on top of changing business needs. You’re still required to keep track of what’s mounted where, must enlarge and reduce filesystems, must move files from one filesystem to another, and must reconfigure logical volume pools — to name just a few of the taxing tasks.
And while you can technically access the same data from multiple servers, you have to be very careful about how you do it. An operating system tends to assume that its disks are “private” and exploits that fact to do things like cache file data. The only way to exploit SAN storage to share data is to make sure the data never changes except via a tightly controlled process usually involving shutting down services.
What you really want from your central heap of storage is something where the blocks of storage go automatically to the server that needs them most at the moment. And you want your applications to be able to read and write files just like the old days — without worrying about what other servers on the SAN are doing.
Enter Storage Tank
Providing this simple, natural view of a SAN is what Storage Tank, a new technology from IBM, is all about. With Storage Tank, you turn all the disks in your SAN into a single, giant filesystem. You mount that filesystem on as many servers as you like, and read and write files as if the filesystem were on a locally attached disk.
There’s no such thing as allocating space to a server — when a server goes to expand a file, unused blocks in the single (heap-like) filesystem get allocated automatically, as with any other disk-based filesystem. And you still get regular Unix file caching; in fact, performance is equivalent to what you’d get if you used the SAN the old way (assign an individual disk to an individual server and build a private ext3 filesystem on it).
You don’t even have to partition your storage into Linux and Windows space. Storage Tank technology works on most modern operating systems, simultaneously. You can have files used by Windows servers and files used by Linux servers, all drawn from the same pool of blocks, and there’s nothing to stop you from using the same file on both Linux and Windows.
|Figure One: In Storage Tank, files flow over two networks. File data flows over a SAN between client systems and disks, while metadata flows over a LAN between client systems and central Storage Tank metadata servers. Metadata further flows between the metadata servers and its ultimate disk home over the SAN. The SAN pictured might actually be many separate SANs.|
Storage Tank technology is available today in one IBM product called IBM Total Storage SAN File System (SANFS). You can read the SANFS sidebar for details on this product, but this article is about the underlying Storage Tank technology, not any particular product.
As you’ll see, the first version of SANFS doesn’t implement all the capabilities of Storage Tank.
What Storage Tank Looks Like
Figure One shows how a Storage Tank system is organized. At its core, Storage Tank separates file data from metadata. File data — the actual contents of the files — lives on file data volumes. Metadata — constructs and attributes like directories, file permissions, modification times — lives on metadata volumes.
Aside from 8K of static volume identification overhead, a file data volume contains nothing but anonymous file data blocks.
No allocation bitmaps, no inodes, and no directories. Nothing on a file data volume hints at the existence of files, and it’s normal for a single file’s blocks to be scattered over multiple file data volumes.
The metadata volumes map which blocks on the file data volumes belong to which files and where free blocks can be found. The metadata volumes also contain all the directories, file naming information, and file attributes.
Systems that access files in a Storage Tank filesystem are called Storage Tank clients. In addition, there is a cluster of servers called Storage Tank metadata servers. Storage Tank clients access the file data volumes directly via the SAN, but never touch the metadata volumes.
Conversely, metadata servers access the metadata volumes but never touch the file data volumes. Storage Tank clients talk to the metadata server cluster over a TCP/IP network (LAN). Clients never talk directly to each other.
The central metadata server cluster may make this look like a traditional file server, like NFS. However, three things make it quite different:
1. The metadata servers never see the file data. File data never flows through their TCP/IP interfaces.
2. Storage Tank clients cache metadata, including their updates to it. They even cache inter-client locks. Unless multiple clients are actually contending over a file, no client-server communication takes place.
3. The individual metadata servers in the cluster are visible to the clients — clients talk directly to them.
All of this means central server work and LAN traffic and associated bottlenecks and latencies are greatly reduced compared to a file server. Note that some of that traffic just shifts to the SAN, whose fundamental purpose is
disk I/O, and is probably built to take it.
IBM Total Storage SAN Filesystem
The first product to use Storage Tank technology became generally available from IBM on October 31, 2003. IBM Total Storage SAN Filesystem (TSANFS) consists of a metadata server cluster (hardware, software, and installation), client software for AIX 5.1 (32-bit) and Windows 2000, and installation service. IBM has announced that it will add client software for Linux in 2004.
TSANFS is supported for use with current IBM storage subsystems over fibre channel/FCP, using certain fibre channel hardware. But there’s no reason to believe it wouldn’t work with any fibre channel/FCP logical units or indeed with any shared storage device that looks like a SCSI device inside the client system. The code has been known to work with iSCSI SANs, for example.
The TSANFS metadata server cluster is a set of two to eight servers, each an IBM X series (IA32) machine running Linux and Storage Tank server software, which runs as a Linux application.
TSANFS doesn’t have Storage Tank’s transparent instantaneous failover in the metadata server cluster. If a metadata server fails, some of your data will be inaccessible for a few minutes. You may even have to reconfigure the cluster manually to distribute the workload to other metadata servers. And the same is true of recommissioning a repaired server and adding and removing servers from the cluster for any reason. The TSANFS metadata cluster does not automatically distribute load among its servers either — you statically assign certain parts of the file name space to certain servers.
Pricing of TSANFS is based on the number of servers in the cluster and the number of clients that use it. The entry list price, which includes two servers and 10 clients, is $90,000.
It’s All About The Caching
The only thing that makes Storage Tank technology more than a simple client/server engineering exercise is all of the caching that’s done by accessors of the filesystem (the Storage Tank clients). Imagine a SAN filesystem without caching: clients would access the disks in the SAN and talk to central metadata servers to satisfy application read and write requests. That would have all the sharing and storage management features we’ve been talking about and wouldn’t require any fancy technology, but it’d all be useless. Systems don’t want to wait to talk to either a disk on the SAN or a metadata server on the LAN. They don’t have to with locally attached disks, so they shouldn’t have to with shared disks either.
The problem with an individual client caching file data or metadata is that you want Client B to see changes that Client A makes to the file the moment after Client A makes them. If Client A buffers its changes locally, that can’t happen. It also can’t happen if Client B caches previously read file information.
But notice that in most cases, there will never be a conflict. Client B doesn’t read a file that Client A is updating. Most data is either used by lots of clients and updated rarely (think information warehouse), or is private to one client and rarely accessed by others (think log files). Remember that a shared filesystem is still important for unshared files such as log files, because of the storage administration angle. All the private log files on all the clients draw from a single pool of disk space and can be moved around by central control. Storage Tank capitalizes on the reality of few and rare conflicts by having clients cache locks.
Imagine that client www1.acme.com wants to read from file /var/www/index.html. www1.acme.com goes to the metadata server to get a read lock (here, the lock is a special Storage Tank lock, known as a data lock, not the kind of lock you get with a flock() or fcntl() system call) on /var/www/ index.html, thus locking out other clients from updating the file and making it practical for the client to cache the file.
When www1.acme.com is done with /var/www/index.html, it doesn’t give up the lock — it keeps it. In fact, it keeps the lock until the server demands it back. The server won’t need to do that unless another client decides to update /var/ www/index.html, which we’ve already said is rare. So the next time an application on www1.acme.com wants to read /var/www/index.html, www1.acme.com need not talk to either a metadata server or a disk. It satisfies the request right out its file cache, with assurance that what’s in its cache is still what’s actually in the filesystem. And it works the same if 100 clients are all reading the same file — none of them talks to the metadata server except at the very beginning.
Now let’s imagine that client www1.acme.com wants to update /var/log/www1/httpd_access. It goes to the metadata server to get a write lock on /var/log/www1/httpd_access, thus preventing any other client from even looking at the file. www1. acme.com is then free to write its applications’ updates into its own memory cache. When it’s done writing, it keeps the write lock for future use.
www1.acme.com can continue processing updates without actually talking to any disks or servers until the server demands the lock back. The server won’t need to do that unless some other client wants to look at /var/log/www1/httpd_access, which we’ve already said is rare.
OK, but what about when the rare event happens and the server has to demand one of these locks back?
Let’s look at the read lock case first. Client admin.acme.com decides to update /var/www/index.html. It goes to the server to request a write lock on /var/www/index.html. The server can’t grant that lock immediately, because it knows that client www1. acme.com still has a read lock on /var/www/ index.html. But any Storage Tank client is required to give up any lock on demand from a metadata server. The client can delay for a short period to get things in order, but cannot refuse or wait for some application to finish what it’s doing.
So, the server demands the lock back from www1.acme.com. Upon receiving the demand, www1.acme.com removes all trace of /var/www/index.html from its cache and then gives up the lock. Once www1.acme.com gives up its lock, the server gives admin.acme.com the write lock it requested, and admin. acme.com is free to start updating /var/www/index.html. Removing the file from the cache ensures that the next time an application on www1.acme.com wants to look at /var/www/ index.html, www1.acme.com will read the file from disk, and thus see the updates that admin.acme.com made.
As for the write lock, let’s say stats.acme.com decides it wants to read /var/log/www1/httpd_access. It goes to the server for a read lock, but the server can’t grant it immediately because www1.acme.com has a write lock on the file. So the server demands the write lock back from www1.acme.com. Remember that www1.acme.com has buffered writes for this file in memory. So, in response to the lock demand, it flushes all that buffered write data to the disk, then purges the cache, and then gives up the lock. After that, the server can give stats.acme.com the write lock it asked for.
The flushing to disk ensures that stats.acme. com sees all the data that www1.acme.com‘s applications wrote up until the moment stats.acme.com acquired the lock.
(To make the previous explanation easier to follow, we’ve assumed a client has a lock or it doesn’t. In practice, the server in this case doesn’t demand the whole lock back — it just tells www1.acme.com to downgrade its write lock to a read lock. That’s all the server needs to grant stats.acme.com the read lock it requested. That way, www1.acme.com still has to flush its buffered writes to disk, but since he’s keeping a read lock, need not purge his cache.)
Caching locks (holding the lock until the server demands it back) ensures that clients rarely have to talk to the metadata server, as long as a file is either rarely updated or mostly used by just one client. And clients still get the full benefit of caching and perfect synchronization when file sharing does take place.
There are also special locks for use with direct I/O. In direct I/O, there is no caching of file data, so full read/write sharing without involving a metadata server is possible.
Now let’s look at the pathological case where a file is frequently updated and used by multiple clients. Following the locking scenarios above, it should be clear that in this case, multiple clients might have to converse with the metadata server every single time any client accesses the file, and clients will rarely have the file cached. Hence, Storage Tank is not a good fit for this workload.
As you read about the things Storage Tank is designed to do (cheap storage administration, simple data sharing), it probably occurs to you that sites have been doing similar things for years by not using a SAN. For example, you can pile all of your storage onto an NFS server in the center of the room, or you can put your disks under an LVM (logical volume manager), make one huge filesystem out of them, and then mount that filesystem on all your servers.
The reason to use a SAN setup instead is simply that there is no NFS server today that can handle the same volume of storage space and data access traffic as an entire SAN. While Storage Tank also has a central server, it is only a metadata server and preemptible lock manager, so it handles only a fraction of the traffic of a regular file server.
There is also the recently released
is based on a TCP/IP object store (under a more liberal definition of SAN, it’s actually a SAN-based system). Lustre has a distributed structure and is designed for the same scale as Storage Tank.
Available and planned Lustre and Storage Tank implementations vary quite a bit in where they fit. If you have Windows servers or a fibre channel SAN, you could get Storage Tank today, but not Lustre. If you have Linux or no fibre channel, you can get Lustre but not Storage Tank.
Looking forward, the set of platforms for Storage Tank implementations will probably be a superset of those for Lustre, at least for a while. Though there’s no hard data on it, based on its structure, Lustre may cost less, scale better, and have greater ultimate capacity than Storage Tank.
Central Storage Administration
Since the primary purpose of merging a whole SAN into a single filesystem is to make storage administration easy, Storage Tank includes a variety of modern storage administration tools that are integrated directly with the metadata servers. Storage Tank tools can:
* AUTOMATE POLICY-BASED MANAGEMENT. You set policies that determine where on the SAN data goes based on filename, owner uid, and other parameters, and the metadata servers realize those rules.
* ADD AND REMOVE STORAGE ON THE FLY. You can add and remove file data and metadata volumes while the system runs, to the extent that your SAN allows. You give a command to Storage Tank to make it start using a new logical unit (disk) or stop using an old one.
* MOVE DATA. You can move data from one group of file data volumes to another while your application servers are actively reading and writing the same blocks.
* MAINTAIN SNAPSHOT COPIES. You can make read-only snapshot copies of a subtree of files. You can also keep multiple copies of each subtree. A snapshot copy shows up as a complete copy of the subtree under a special snapshot directory in the filesystem.
* ESTABLISH QUOTAS. You can set a quota on a subtree of files so that a runaway application can’t use up all the free space in the system.
A drawback of the Storage Tank snapshot and quota facilities is that the subtrees mentioned (commonly called file sets) are the same for both. And a single metadata server (at a time) must serve all the files in a subtree, so that limits the practical size of one. So, these features are not as flexible as you might imagine.
Storage Tank isn’t the only technology around for realizing the SAN dream. IBM has been working on Storage Tank for over 6 years and in that time others have noticed this gaping hole in SAN technology and risen to meet the challenge as well.
SGI’s CXFS is like Storage Tank, except without the multinode metadata server cluster. It works with Linux, AIX, Solaris, Irix, and Windows clients. ADIC’s StorNext File System product may be the oldest Storage-Tank like system, organizing SANs since 1998. IBM itself has been shipping
Like Storage Tank, GPFS turns a SAN into a single shared filesystem. Additionally, GPFS takes a rather different, more distributed approach (there’s no standalone metadata service) that gives better scalability. However, the distributed approach also makes storage management harder to implement, and in fact, existing implementations of GPFS don’t do much of it. Furthermore, IBM’s GPFS product today, which was developed specifically for cluster computing, is limited to either AIX or Linux, but not both at the same time. IBM’s direction in storage indicates that future product development will by based on Storage Tank rather than GPFS.
Sistina’s GFS and its open source cousin, openGFS, are very similar to GPFS. These products allow only Linux clients.
While Storage Tank is designed for very large computing complexes, the existence of a central metadata service would give any system designer chills because it represents a single point of failure (SPoF). If the metadata service fails, the entire complex fails. Because of this, Storage Tank comes with fault tolerance.
Storage Tank metadata servers form a cluster. Each server can do the job of any other server in the cluster, so if one falls, another one can pick up its work.
Storage Tank clients carefully maintain a copy of any volatile information in a server so that if a workload transfers, the new server and the existing clients can together reconstruct the exact state of filesystem access at the moment of failure. The loss of a server is transparent.
Maintaining a little diversity among servers (for example, applying a software update to each server, one at a time) helps ensure that all servers don’t fail at once. The same principles apply to scheduled maintenance. You can maintain the server cluster one server at a time and never have to shut down any services.
What’s Not to Like?
Given all of the features that Storage Tank provides, you may be wondering, “So, what gives?” Here’s a brief list of what’s not in Storage Tank:
* REMOTE ACCESS. You have to be within SAN distance of the storage.
* FREQUENT READ/WRITE SHARING OF FILES.
* OPEN NETWORKS. Every computer that can connect to either the SAN storage devices or a metadata server must be trusted to follow Storage Tank rules in accessing the filesystem.
* RAID. Storage Tank expects this to be within the SAN.
* MULTIPATHING. Storage Tank expects this to be within the SAN.
Openness of the Technology
IBM has published the specifics of its Storage Tank protocol, the network protocol (transported over TCP or UDP) for communication between Storage Tank clients and Storage Tank metadata servers. IBM has also published, and licensed under the GNU Public License, a reference implementation of a Storage Tank client for Linux. This reference implementation is complete, but needs additional work and testing to meet normal, large system performance and reliability requirements.
IBM intends to release the production-grade SANFS Linux client under the GPL as well. IBM encourages people to develop Storage Tank client software (which consists primarily of a Storage Tank filesystem driver) for lots of platforms, and encourages developers of similar metadata servers to use the Storage Tank protocol.
IBM hopes to make money from Storage Tank by selling metadata server hardware and software. IBM also hopes Storage Tank will increase the value of SAN storage so that IBM can sell more of that.
Bryan Henderson is a software engineer on IBM’s Storage Tank project. He wrote much of the code that will appear in the SANFS Linux client. Henderson has worked in large system storage since 1986, and on Storage Tank since 2001. You can reach Bryan at firstname.lastname@example.org.