In the old days, disk space cost a pretty penny, so saving space was essential. But now that disk space costs about $0.50 per gigabyte, a lot of folks never worry about deleting files, let alone compressing them. However, if you're administering a large, shared server (such as for email), it seems that you can never have too much space.
In the old days, disk space cost a pretty penny, so saving space was essential. But now that disk space costs about $0.50 per gigabyte, a lot of folks never worry about deleting files, let alone compressing them. However, if you’re administering a large, shared server (such as for email), it seems that you can never have too much space.
Years ago it was common to compress files using the compress command that was standard on most versions of Unix. It did a reasonable job of shrinking most text files without using a ton of CPU power (because CPUs weren’t terribly fast either). Nowadays, with big disks, lots of memory, and very fast CPUs, you have many more options when you do decide to compress your files.
Three freely available and relatively common compression tools are gzip, bzip2, and rzip. Let’s compare them, using a standard Unix mailbox file (in mbox format) containing a few months’ worth of electronic mail as a benchmark. (Obviously, you should perform your own testing and use a sample of the data you’re trying to compress, but email is common enough that it should provide an adequate measure of the three tools’ capabilities.)
This Just In…
The results of compressing a 180 MB mailbox on a 2.4 GHz machine with 2 GB of RAM and a 2.4.23 kernel are shown in Table One.
Table One: Compressing a 180 MB mbox-format file
Today, you can find gzip nearly everywhere. gzip‘s been around for years and has all but replaced the old compress utility on most Unix-like systems.
gzip is known for its excellent performance. As the table shows, gzip is fast. Like most command-line tools, gzip can read from and write to a pipe, so the data you’re compressing need not actually reside in a file on disk.
Interestingly, for an mbox file, the difference in compressed file sizes was less than a megabyte between gzip‘s default mode (-6) and its “try hardest” (-9) mode. However, results can vary based on the data you’re compressing.
Where gzip prefers speed over efficiency, rzip attempts to produce the smallest possible files.
In the mbox test, rzip compressed the file nearly twice as well as anything else. In fact, this test really doesn’t do rzip justice. On a 1 GB mbox file containing mostly spam, rzip was able to achieve a compression ratio of almost 8:1!
However, there’s a price: rzip took about 219 MB of RAM to do the job. Moreover, rzip isn’t commonly installed on most Linux distributions, and rzip can only compress files that already live on disk. It cannot read from a pipe.
rzip was the Ph.D. thesis work of Andrew Tridgell (of Samba fame), and actually uses the core compression library of bzip2 under the hood.
While not quite as popular as gzip, bzip2 is nearly as ubiquitous. As the table shows, bzip2 uses a lot more CPU time, but provides better compression.
In general, bzip2 produces smaller files, trading CPU time for disk space.
But the trade isn’t very fair, given the almost order-of-magnitude difference in CPU utilization. (By the way, with bzip2, -9 is the default.)
And the Winner Is ….
If you’re trying to get the smallest possible file without sacrificing portability, bzip2 is likely the best choice. But for pure performance, stick with gzip.
While rzip isn’t as well known as Samba, expect to see more and more of it as CPUs continue to get faster and memory comes down in price.
Do you have an idea for a “Do It Yourself” project? Send your suggestions to firstname.lastname@example.org.