One of the most interesting features in Solaris is its ZFS filesystem. Read on for a quick guide to ZFS administration.
ZFS is an open source file system that was developed at Sun Microsystems and runs on OpenSolaris releases and Solaris 10 releases. The ZFS file system was also ported to Fuse, FreeBSD, and Mac OS X releases. In this article, we’ll look at some basic tasks that you can do with ZFS, and how to accomplish them.
ZFS is About Easy Administration
ZFS is a general purpose file system that is built on top of a storage pool. You can use ZFS to reduce the time it takes to manage your storage and your data because ZFS cuts out the middle man. ZFS doesn’t use a virtualized volume layer like other volume management products, so, you can easily aggregate your storage under ZFS with a few simple commands.
ZFS makes it easy to create and manage file systems without needing multiple commands or editing configuration files. You can easily set quotas or reservations, turn compression on or off, manage mount points for numerous file systems, and so on, all with a single command.
ZFS provides unlimited constant-time snapshots. A snapshot is a read-only, point-in-time copy of a file system. Any snapshot can generate a full backup, and any pair of snapshots can generate an incremental backup.
No arbitrary limits exist in ZFS. You can have as many files as you want, including unlimited links, directory entries, snapshots, and so on.
Setting Up Your ZFS Storage Pools and File Systems
ZFS aggregates devices into a storage pool instead of using a volume management layer that virtualizes volumes. A ZFS storage pool describes the physical characteristics of the storage, such as the device layout and data redundancy.
File systems created from a storage pool are allowed to share space with all file systems in the pool. You don’t have to predetermine the size of a file system because file systems grow automatically within the space allocated to the storage pool. When new storage is added, all file systems within the pool can immediately use the additional space.
For redundancy, ZFS supports mirrored, RAID-Z (single parity), or RAIDZ-2 (double parity) configurations. RAID-Z is similar to RAID-5, but uses variable stripe width to eliminate the RAID-5 write hole (stripe corruption due to loss of power between data and parity updates). All RAID-Z writes are full-stripe writes. There’s no read-modify-write tax, no write hole, and the best part, no need for NVRAM in hardware.
Check here for recommendations about the best ZFS redundancy configuration for your storage needs, which describes strategies for using ZFS with storage devices that also provide redundancy. One benefit of using ZFS redundancy is that ZFS is not only able to report, but also repair potential data inconsistencies.
However, if you choose not to implement ZFS redundancy, then ZFS can only report those data inconsistencies.
Create a Simple ZFS Storage Pool
The following example shows how to create a simple, non-redundant (striped) storage pool called users with two 465GB disks.
# zpool create users c1t0d0 c1t1d0
# zpool status users
pool: users
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
users ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
errors: No known data errors
The approximate space in this pool is 928 Gbytes.
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
users 928G 112K 928G 0% ONLINE -
A ZFS file system is called /users is automatically created and mounted.
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
users 108K 913G 18K /users
A striped configuration doesn’t provide the ZFS redundancy feature to repair data inconsistencies. We can easily convert this pool into a mirrored ZFS configuration to provide device redundancy protection.
For more information, see Convert a Simple ZFS Storage Pool to a Mirrored Storage Pool later in this article.
Create ZFS File Systems
After the pool is created, you can create a hierarchy of ZFS file systems to match your environment.
Each file system has access to all the usable space in the pool.
# zfs create users/userA
# zfs create users/userB
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
users 160K 913G 19K /users
users/userA 18K 913G 18K /users/userA
users/userB 18K 913G 18K /users/userB
If userA needs more disk space, then set a reservation so that userA has this space available.
# zfs set reservation=30GB users/userA
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
users 30.0G 883G 21K /users
users/userA 18K 913G 18K /users/userA
users/userB 18K 883G 18K /users/userB
Next, create snapshots for each user file system in the pool.
# zfs snapshot -r users@today
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
users 30.0G 883G 21K /users
users@today 0 - 21K -
users/userA 18K 913G 18K /users/userA
users/userA@today 0 - 18K -
users/userB 18K 883G 18K /users/userB
users/userB@today 0 - 18K -
Convert a Simple ZFS Storage Pool to a Mirrored Storage Pool
You can convert a non-redundant pool by attaching same size disks to create a mirrored storage pool configuration.
# zpool attach users c1t0d0 c2t0d0
# zpool attach users c1t1d0 c2t1d0
# zpool status
pool: users
state: ONLINE
scrub: resilver completed with 0 errors on Thu Feb 14 14:02:58 2008
config:
NAME STATE READ WRITE CKSUM
users ONLINE 0 0 0
mirror ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c2t1d0 ONLINE 0 0 0
errors: No known data errors
The above syntax converts the simple striped configuration to a two two-way mirrored configuration by using the zpool attach command. Specify one of the existing disks for each attach operation. The amount of disk space remains the same, but the data protection is greatly increased.
In a ZFS mirrored configuration, you can use the zpool detach and zpool attach features to automatically expand the size of a pool. For more information, see the ZFS Administration Guide.
Replace Devices in a ZFS Storage Pool
You can identify faulted devices with the zpool status command. The output below identifies that c2t1d0 is unavailable.
# zpool status
pool: users
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://www.sun.com/msg/ZFS-8000-2Q
scrub: resilver completed with 0 errors on Thu Feb 14 14:17:14 2008
config:
NAME STATE READ WRITE CKSUM
users DEGRADED 0 0 0
mirror ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
mirror DEGRADED 0 0 0
c1t1d0 ONLINE 0 0 0
c2t1d0 UNAVAIL 0 0 0 cannot open
errors: No known data errors
Because this is a mirrored configuration with sufficient replication, all data is online and available. After the faulted device is physically replaced, notify ZFS of the replacement with the zpool replace command. The basic steps are:
- Shut down the system (if devices are not hot-pluggable)
- Replace the faulty disk
- Bring the system back up (if devices are not hot-pluggable)
- Notify ZFS of the disk replacement. (For example, c2t1d0 was replaced with an identical device in the same location with
#zpool replace users c2t1d0.)
In addition to faulty device replacement, you can use the zpool replace command to expand your existing storage pool with larger disks. For example, if your storage pool consists of 3 100-Gbyte disks (c0t1d0, c1t1d0, c2t1d0), you could replace them with three 300GB disks (c3t1d0, c4t1d0, c5t1d0), by using syntax similar to the following:
# zpool replace pool-name c0t1d0 c3t1d0
# zpool replace pool-name c1t1d0 c4t1d0
# zpool replace pool-name c2t1d0 c5t1d0
Currently, you will need to export and import the pool to see the expanded disk space. For more information, see the ZFS Administration Guide. Keep in mind that the amount of time it takes to copy or resilver data from the original disk to the replacement disk depends on the amount of data on the original disk.
Reduce Hardware Down Time by Using Hot Spares
You can also add spare devices to your storage pool to reduce hardware failure down time, like so:
# zpool add users spare c3t1d0 c4t1d0
# zpool status
pool: users
state: ONLINE
scrub: resilver completed with 0 errors on Thu Feb 14 14:19:01 2008
config:
NAME STATE READ WRITE CKSUM
users ONLINE 0 0 0
mirror ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c2t1d0 ONLINE 0 0 0
spares
c3t1d0 AVAIL
c4t1d0 AVAIL
If a device in a storage pool fails, the spare automatically kicks in.
Tips on Managing ZFS Storage Pools and File Systems
To close, let’s take a quick look at a few tips for managing ZFS storage pools and file systems.
Set up a zpool scrub schedule to search for data inconsistencies. For more information, see the best practices guide here.
Monitor pool and file system activity with the zpool history command.
Always notify ZFS when you are going to remove devices from a pool. You cannot detach or offline disks in a non-redundant ZFS configuration. If you remove a disk from the system without notifying ZFS, the pool might become unavailable. For considerations about using ZFS in your environment, see the FAQ.
You can set the redundancy level for file system data by using the ZFS copies property, but setting this property does not protect against all disk failures. For more information about using the copies property, check out the blogs.
For a list of ZFS product reviews and topics, such as using ZFS with iSCSI volumes, databases, and an automatic snapshot services, see this site.
There is, of course, much more to ZFS, but this article should get you started in administering ZFS.
Cindy Swearingen is a longtime system administration advocate and is currently employed by Sun Microsystems in Broomfield, Colorado.
Comments on "Solving Common Administration Problems with ZFS"
Somewhat lame and shallow article on such an awesome product. The info here is very outdated.
I concur with dutler. Perhaps it was written with the “business” person in mind.
Do any of you guys know of good articles which would offer a better overview?
Regardless of what the first two critics said, this is a good overview of common tasks that can be done with zfs. It serves its purpose of giving folks who aren’t familiar with zfs a taste of what’s possible.
It’s not an in-depth tutorial of every aspect of how to use zfs and why — but it never claimed to be. “dutler” and “dvnguyen” need to pull their heads out.
Oh, dear – where to begin? The ZFS hype has started to assume the dimensions of open-source religion, so it may not even be worth commenting.
But just in case:
1. While not saying so out-right, the article still apparently seeks to perpetuate the myth that ZFS *eliminated* a full administrative layer (volume management), whereas in fact it simply redefined it as ‘pool management’. Yes, by virtue of being tightly-integrated with the file system it’s somewhat more flexible and easy to manage than traditional volume management, but it still leaves a lot to be desired (e.g., needing a separate pool with specific disk match-ups for each flavor of redundancy, and preferably for the system root pool as well: if that doesn’t remind you of ‘volume management’, what would?). Look to the Drobo product for a primitive example of what’s possible here.
2. It’s not at all excessive to characterize ZFS’s “RAID-Z” implementation as being brain-damaged. Yes, its full-stripe-write-only approach eliminates the dreaded ‘write hole’, but so do most conventional RAID-5 approaches – and ZFS *could* have done so wholly in software by using the transaction log which it includes anyway. Furthermore, the price that it pays for avoiding the “read-modify-write tax” (again, something the impact of which it could have significantly alleviated by other means) is to involve all the disks in the stripe on every write (a dubious trade-off all by itself) *and all but one of them on every read as well* (which can reduce throughput by a factor of N-1 for small random-read workloads).
3. Despite the fact that modern Unix filesystems (recognizing the value of on-disk contiguity for bulk access) have mostly migrated toward extent-based storage (even ext4 reportedly will, last I heard), ZFS clings to single-block mapping and allocation and limits the maximum block size to 128 KB. This can lead to severe data fragmentation, especially in contentious multi-threaded server applications and even more especially for files that are randomly updated – and there’s no defragmenter.
4. In an article about file system administration you’d normally see at least some mention of quota management, but that’s a bit awkward for ZFS because it supports neither user nor group quotas – just quotas on entire filesystems. So if you want to establish something resembling user quotas, you need to define and mount a different filesystem for each user and then confine his/her activities to that sub-tree (since any files created elsewhere will be charged to someone else). And the closest you can get to a group quota is to define yet another filesystem and have the group work within that sub-tree.
And finally, of course, there’s the issue that ZFS doesn’t appear to be portable to Linux – at least as a conventional kernel filesystem – due to license incompatibilities. So what’s a puff piece by a Sun employee about how to administer it doing in Linux Magazine?
ZFS does have some interesting aspects (though exactly how ‘novel’ most of them are is subject to debate), but the hype has gotten tiresome enough that I sometimes take the time to present at least a brief alternate viewpoint – just to clarify that the claim that ZFS constitutes “The Last Word In File Systems” (the title of Sun’s early ZFS presentations) is decidedly open to question.
- bill
Man, tough crowd! Thanks for the article–it was interesting and informative.
I’ve never met a “business person” who gave a crap about filesystems in *nix operating systems. See the abstact under the article’s title “…Read on for a quick guide to ZFS administration.”
As promised, this was a quick guide. What else did you expect??? For those looking for more depth, I bet they know how to use Google.
Could dutler or dvnguyen point us to their blog or website so we can see their in-depth write-ups on ZFS?
1. … but it still leaves a lot to be desired (e.g., needing a separate pool with specific disk match-ups for each flavor of redundancy, and preferably for the system root pool as well: if that doesn’t remind you of ‘volume management’, what would?). Look to the Drobo product for a primitive example of what’s possible here.
You most definitely do not require a separate pool for each redundancy type. A “pool” is comprised of multiple “vdevs”. How you create those vdevs is up to you. Data in the pool will be striped across all the vdevs in the pool. Hence, you can create any kind of redundancy setup you want:
zpool create pool mirror da0 da1
zpool add pool raidz1 da2 da3 da4
zpool add pool raidz2 da5 da6 da7 da8
Voila! A single pool comprised of three vdevs. Data will be striped across all three vdevs, thus creating a hybrid “RAID 1+5+6/0″.
That’s the whole point of pooled storage. You just keep adding vdevs as needed to expand the total size of the pool. And you just replace drives within vdevs to expand the vdevs (also expanding the size of the pool).