Ceph: The Distributed File System Creature from the Object Lagoon

Did you ever see one of those terrible Sci-Fi movies involving a killer Octopus? Ceph, while named after just such an animal, is not a creature about to eat an unlucky Spring Breaker, but a new parallel distributed file system. The client portion of Ceph just went into the 2.6.34 kernel so let's learn a bit more about it.

The last two years have seen a large number of file systems added to the kernel with many of them maturing to the point where they are useful, reliable, and in production in some cases. In the run up to the 2.6.34 kernel, Linus recently added the Ceph client. What is unique about Ceph is that it is a distributed parallel file system promising scalability and performance, something that NFS lacks.

High-level view of Ceph

One might ask about the origin of Ceph since it is somewhat unusual. Ceph is really short for Cephalopod which is the class of moulluscs to which the octopus belongs. So it’s really short for octopus, sort of. If you want more detail, talk a look at the Wikipedia article about Ceph. Now that name has been partially explained, let’s look at the file system.

Ceph was started by Sage Weil for his PhD dissertation at the University of California, Santa Cruz in the Storage Systems Research Center in the Jack Baskin School of Engineering. The lab is funded by the DOE/NNSA involving LLNL (Lawrence Livermore National Labs), LANL (Los Alamos National Labs), and Sandia National Laboratories. He graduated in the fall of 2007 and has kept developing Ceph. As mentioned previously, his efforts have been rewarded with the integration of the Ceph client into the upcoming 2.6.34 kernel.

The design goals of Ceph are to create a POSIX file system (or close to POSIX) that is scalable, reliable, and has very good performance. To reach these goals Ceph has the following major features:

  • It is object-based
  • It decouples metadata and data (many parallel file systems do this as well)
  • It uses a dynamic distributed metadata approach

These three features and how they are implemented are at the core of Ceph (more on that in the next section).

However, probably the most fundamental core assumption in the design of Ceph is that large-scale storage systems are dynamic and there are guaranteed to be failures. The first part of the assumption, assuming storage systems are dynamic, means that storage hardware is added and removed and the workloads on the system are changing. Included in this assumption is that it is presumed there will be hardware failures and the file system needs to adaptable and resilient.

More in Depth

With the general view of Ceph in mind, let’s dive down into some more details to understand how it’s implemented and what it means. Below in Figure 1 is an overview of the layout of Ceph.

Figure 1: System layout of Ceph.

There are client nodes (the happy smiling faces), a metadata cluster, and the object storage cluster where the data is stored. When a client wants to open a file, it contacts the metadata cluster, that is referred to as the MDS, or MetaData Server, which is in fact a cluster. The MDS returns information to the client that tells it what it’s capabilities are (what it can and cannot do), file size, striping information (the data is striped across multiple storage devices for performance reasons), and something called a file inode (used by Ceph). Once the data is received the client sends/receives data directly from the Object Storage Devices (OSD’s) which make up the data storage cluster. During the data transactions the MDS is checked to see if there are any changes. If there are none, then everything proceeds normally. If there are changes the MDS notifies the client and the OSD’s. One everything is done and the close request is sent to the MDS and OSD’s to close the file, the the client updates the MDS with any details and the MDS marks the file as closed and updates the metadata information.

Object-Based Storage
The system layout serves as a guide for further discussing the details and features of Ceph. One of the first features that is important is to be explained is the object-based approach of the file system. In an object-based file system, the data is broken into objects that are assigned an object ID number and a small amount of metadata and then sent for storage on the Object Storage Devices (OSD’s). The file system metadata for that file then consists of a number of object ID’s that define all of the data as well as other information about the file (e.g. access/modify dates, etc.). Typically, the metadata does not know know precisely where the file is located and relies on the OSD’s for the storage and retrieval of the actual data. The OSD takes care of the lower-level functions itself (kind of a “smart” hard drive if you will). The file system interacts with the OSD’s at a high-level requesting the object itself or information about the object rather than asking for a range of inodes or blocks or something similar.

While there have only been experimental OSD drives the typical way of creating an OSD is to use a middle layer of software between the object based file system and the file system on the drive itself (or even the drive itself). In this approach the drive is just a regular hard drive such as those we currently use. Typically the OSD middle layer converts the object request into a file system request on the underlying drive.

Initially Ceph used something called EBOFS (Extent and B-tree based Object File System) but support was dropped in mid-2009. It was replaced with btrfs which promises to give as good or better performance than EBOFS. In addition, btrfs has a few features that EBOFS does not. Namely,

  • Copy-on-write semantics for file data (who doesn’t like a COW?)
  • Well maintained and tested (it’s in the kernel and under heavy development)

In addition, according to the Ceph wiki,

"... To avoid reinventing the wheel, Ceph will use btrfs on individual storage nodes (OSDs) to store object data, and we will focus on adding any additional functionality needed to btrfs where it will hopefully benefit non-Ceph users as well. ..."

For example, there is a recent patch that adds some features to btrfs that help Ceph.

Distributed Metadata
Another key aspect of Ceph that distinguishes itself from other file systems is that it uses something Sage terms “Dynamic Distributed Metadata Management.” The first keyword is distributed meaning multiple metadata servers unlike Lustre which only has one metadata server. Being distributed means that the lose of a metadata server (MDS) won’t cause the entire file system to crash.

The second keyword in the title is Dynamic. This means that the metadata can actually be moved or redistributed from one MDS to another. If a MDS goes down or is added, portions of the file system directory hierarchy are moved to better balance performance and capacity. This distribution is based on the workload but preserves locality in each MDS’s workload improving performance because the metadata can be aggressively prefetched.

Dynamic metadata also means that over time the metadata is redistributed to make better use of resources including load balancing for systems that don’t even add storage hardware. So if a certain part of the directory tree was used more often than others, it can either be divided across MDS nodes or consolidated to a single MDS coupled with aggressive caching.

Reliability through Replication
Typical file systems, even distributed parallel ones, rely on data storage units that have RAID or SAN fail-over mechanisms to help maintain data access. This also includes redundant power supplies, possibly redundant RAID controllers, redundant network cards, and other costly hardware solutions. An example of this is Lustre. On the opposite of this approach is Ceph that uses replication to help maintain access to data. Ceph maintains copies of data across the OSD’s to ensure that the loss of any OSD or multiple OSD’s will not cause the loss of data. If an OSD is lost the objects that it contained are on other OSD’s and are immediately copied to other remaining OSD’s so that the proper number of copies is maintained. The copies are spread out so that no “hot spots” develop in the replication process and as much replication as possible takes place in parallel.

Using replication does mean that you use more capacity to store the same data but it also means that you don’t need parity disks or “spare” disks making 100% use of all the storage in the OSD’s. It also means that you don’t develop hot spots in the OSD’s waiting for a RAID rebuild. Moreover, since you don’t need to do a RAID rebuild you don’t need the compute power, saving money and electrical power.

Distributed Object Storage
One way to achieve better performance is to stripe data across multiple OSD’s (something like RAID-0). Ceph does this and uses replication to ensure that the lose of an OSD does not mean that the data is lost. The component of Ceph that does this is called RADOS (Reliable Autonomic Distributed Object Store). Figure 2 below presents how the data from a file is broken into objects and distributed to the OSD’s.

Figure 2: Ceph Distributed Object Storage.

A file is broken into objects and then these objects are mapped into placement groups (PG’s) using a simple hash function. Then the placement groups are assigned to OSD’s using a component of Ceph called CRUSH (Controlled Replication Under Scaling Hashing). CRUSH is a pseudo-random data distribution function that efficiently maps each PG to an ordered list of OSD’s where copies of the object are stored. One feature of CRUSH is that it is a globally known function so any component of Ceph (client, MDS, OSD) can compute the location of an object. This means that you don’t have to involve the MDS to compute the location of an object.

Relaxation of POSIX (sort of)
Ceph uses the phrase “near-POSIX” because it has the ability to relax some of the POSIX semantics to improve performance (see the recent article POSIX IO Must Die!). In particular it uses a subset of a proposed set of extensions for POSIX for HPC (High-Performance Computing).

A classic example illustrating why extensions are needed for POSIX is that when a file is opened by multiple clients (usually happens in HPC) where each client has either multiple writers or a mix of readers and writers, the metadata server will revoke any read caching and write buffering capabilities to make sure that all clients access the data correctly. This forces the client IO to suddenly become synchronous and the performance drops tremendously particularly for small files (POSIX is at least enforcing consistency – always good). However, some applications already know that they don’t have consistency issues because of the design of the application (this is common in HPC applications) but they have to suffer a severe performance penalty because POSIX has chosen to trust no one – even if the application is correct because each writer or reader works on an independent part of the file.

The proposed POSIX extensions have options to address this issue as well as others. In particular, there is an option O_LAZY that is used for an open() syscall that explicitly relaxes coherency for a shared-write file. It assumes that the application is managing it’s own coherency. As previously mentioned, in HPC many applications can read/write to a single file from many processes since each process works on an independent part of the file. Using the O_LAX option means that the applications can run at higher speeds using caching and buffering that POSIX normally allows.


Ceph has a number of features which make it very attractive for the growing file systems we all are experiencing. It has designed for scalability, reliability, and performance. At the same time is assumes that hardware will fail or have hardware added, so it has a design that can adapt to these situations. Ceph breaks the file system into two pieces: (1) metadata, and (2) data. This allows each piece to be designed in the most efficient manner to achieve these three goals of Ceph.

Ceph uses a dynamic distributed metadata server (MDS) that is not only clustered but also adapts to the changing workload. It will automatically distribute portions of the hierarchical directory tree to other MDS servers in the cluster to better load balance as the workload changes. In addition, if a MDS server is added, it will move portions of the metadata to that new box, again, better distributing the load.

The concept of replication is used along with Object Server Devices (OSD’s) so that all the space on all the drives is used (no parity drives, no spare drives). During the writing of an object to Ceph, it is automatically replicated to other OSD’s so that the loss of an OSD(s) won’t result in the loss of data. If an OSD is lost, the objects are again re-replicated so that the number of copies of the objects is maintained.

While the Ceph client was recently include in the 2.6.34 kernel (it was in a “rc” version where rc = release candidate) it is still considered not ready for prime-time. It also uses btrfs as the underlying storage mechanism for the OSD’s and btrfs itself is still in development. But including the client in the kernel does three things. First, it gives a vote of confidence to Ceph. Second, since it’s in the kernel it should get some more “development eyes” examining the code. And third, it should get more testing.

If you’re feeling “experimental” or have an upcoming need for larger amounts of storage, then give Ceph a try. It’s really not a scary octopus about to eat your boat.

Comments on "Ceph: The Distributed File System Creature from the Object Lagoon"

Here are a number of the web-sites we advocate for our visitors.

Check below, are some absolutely unrelated websites to ours, having said that, they’re most trustworthy sources that we use.

The time to study or pay a visit to the material or internet sites we have linked to beneath.

That will be the finish of this post. Right here you will locate some web pages that we believe you?ll appreciate, just click the hyperlinks.

I know this if off topic but I’m looking into starting my own blog
and was wondering what all is needed to get set up?

I’m assuming having a blog like yours would cost a pretty penny?
I’m not very web savvy so I’m not 100% sure. Any tips or advice would be greatly appreciated.
Thank you

Every as soon as in a although we decide on blogs that we read. Listed beneath are the most current web sites that we pick.

Every the moment in a while we choose blogs that we study. Listed beneath are the most current internet sites that we choose.

That may be the end of this report. Here you?ll uncover some internet sites that we feel you?ll enjoy, just click the hyperlinks.

Although internet websites we backlink to beneath are considerably not related to ours, we feel they may be truly really worth a go by, so possess a look.

Here are some of the web sites we recommend for our visitors.

Here are some links to sites that we link to since we consider they’re really worth visiting.

Here are some links to web pages that we link to mainly because we consider they may be worth visiting.

Check below, are some totally unrelated internet websites to ours, having said that, they are most trustworthy sources that we use.

Usually posts some quite interesting stuff like this. If you are new to this site.

Every the moment in a while we opt for blogs that we read. Listed below are the most recent web-sites that we opt for.

Every when inside a when we select blogs that we read. Listed beneath are the most up-to-date web sites that we select.

Always a major fan of linking to bloggers that I adore but really don’t get a good deal of link appreciate from.

Sites of interest we have a link to.

Here are some links to web-sites that we link to because we think they may be worth visiting.

Always a large fan of linking to bloggers that I enjoy but really don’t get a whole lot of link like from.

Wonderful story, reckoned we could combine a couple of unrelated information, nonetheless seriously worth taking a look, whoa did one particular understand about Mid East has got more problerms at the same time.

Whoa! This blog looks exactly like my old one! It’s on a entirely different topic but it has pretty much the same page layout and design. Superb choice of colors!

Every when in a although we pick blogs that we study. Listed below would be the most up-to-date web sites that we select.

We came across a cool web page that you simply may possibly get pleasure from. Take a search when you want.

Below you?ll discover the link to some web sites that we feel you must visit.

Very couple of internet sites that take place to be detailed beneath, from our point of view are undoubtedly effectively really worth checking out.

Here are a few of the web-sites we suggest for our visitors.

The data mentioned inside the report are some of the best offered.

One of our visitors recently proposed the following website.

Although internet sites we backlink to beneath are considerably not connected to ours, we feel they are actually worth a go by, so possess a look.

We came across a cool web page that you just could possibly get pleasure from. Take a search should you want.

8oq74y owjsoudnwmdi, [url=http://qakkdpokurin.com/]qakkdpokurin[/url], [link=http://xwckijyedlaf.com/]xwckijyedlaf[/link], http://qfpzyxtybkuh.com/

Check below, are some absolutely unrelated web sites to ours, nevertheless, they’re most trustworthy sources that we use.

Here is a superb Blog You might Discover Interesting that we encourage you to visit.

Wonderful story, reckoned we could combine a few unrelated information, nevertheless genuinely worth taking a appear, whoa did 1 find out about Mid East has got extra problerms at the same time.

I each and every time used to study paragraph in news papers however right now when i
am a user of internet so from now I am using net for articles, because of

Feel free to surf to my blog – PatRGarrison

The time to read or pay a visit to the content material or web sites we have linked to below.

Here are some links to internet sites that we link to for the reason that we believe they are worth visiting.

Check beneath, are some totally unrelated internet sites to ours, nevertheless, they are most trustworthy sources that we use.

Just beneath, are a lot of totally not connected internet sites to ours, nonetheless, they may be surely worth going over.

Every once inside a though we pick out blogs that we study. Listed below would be the newest websites that we opt for.

That could be the end of this post. Here you will come across some internet sites that we assume you will value, just click the links.

One of our guests not long ago encouraged the following website.

Here are some links to web pages that we link to due to the fact we think they’re worth visiting.

We came across a cool website that you may well take pleasure in. Take a appear should you want.

Usually posts some incredibly intriguing stuff like this. If you?re new to this site.

Here are some hyperlinks to websites that we link to mainly because we assume they are worth visiting.

One of our visitors recently advised the following website.

That would be the finish of this write-up. Right here you will find some websites that we believe you?ll appreciate, just click the hyperlinks.

Always a large fan of linking to bloggers that I appreciate but really don’t get a lot of link really like from.

Leave a Reply