Ceph: The Distributed File System Creature from the Object Lagoon

Did you ever see one of those terrible Sci-Fi movies involving a killer Octopus? Ceph, while named after just such an animal, is not a creature about to eat an unlucky Spring Breaker, but a new parallel distributed file system. The client portion of Ceph just went into the 2.6.34 kernel so let's learn a bit more about it.

The last two years have seen a large number of file systems added to the kernel with many of them maturing to the point where they are useful, reliable, and in production in some cases. In the run up to the 2.6.34 kernel, Linus recently added the Ceph client. What is unique about Ceph is that it is a distributed parallel file system promising scalability and performance, something that NFS lacks.

High-level view of Ceph

One might ask about the origin of Ceph since it is somewhat unusual. Ceph is really short for Cephalopod which is the class of moulluscs to which the octopus belongs. So it’s really short for octopus, sort of. If you want more detail, talk a look at the Wikipedia article about Ceph. Now that name has been partially explained, let’s look at the file system.

Ceph was started by Sage Weil for his PhD dissertation at the University of California, Santa Cruz in the Storage Systems Research Center in the Jack Baskin School of Engineering. The lab is funded by the DOE/NNSA involving LLNL (Lawrence Livermore National Labs), LANL (Los Alamos National Labs), and Sandia National Laboratories. He graduated in the fall of 2007 and has kept developing Ceph. As mentioned previously, his efforts have been rewarded with the integration of the Ceph client into the upcoming 2.6.34 kernel.

The design goals of Ceph are to create a POSIX file system (or close to POSIX) that is scalable, reliable, and has very good performance. To reach these goals Ceph has the following major features:

  • It is object-based
  • It decouples metadata and data (many parallel file systems do this as well)
  • It uses a dynamic distributed metadata approach

These three features and how they are implemented are at the core of Ceph (more on that in the next section).

However, probably the most fundamental core assumption in the design of Ceph is that large-scale storage systems are dynamic and there are guaranteed to be failures. The first part of the assumption, assuming storage systems are dynamic, means that storage hardware is added and removed and the workloads on the system are changing. Included in this assumption is that it is presumed there will be hardware failures and the file system needs to adaptable and resilient.

More in Depth

With the general view of Ceph in mind, let’s dive down into some more details to understand how it’s implemented and what it means. Below in Figure 1 is an overview of the layout of Ceph.

Figure 1: System layout of Ceph.

There are client nodes (the happy smiling faces), a metadata cluster, and the object storage cluster where the data is stored. When a client wants to open a file, it contacts the metadata cluster, that is referred to as the MDS, or MetaData Server, which is in fact a cluster. The MDS returns information to the client that tells it what it’s capabilities are (what it can and cannot do), file size, striping information (the data is striped across multiple storage devices for performance reasons), and something called a file inode (used by Ceph). Once the data is received the client sends/receives data directly from the Object Storage Devices (OSD’s) which make up the data storage cluster. During the data transactions the MDS is checked to see if there are any changes. If there are none, then everything proceeds normally. If there are changes the MDS notifies the client and the OSD’s. One everything is done and the close request is sent to the MDS and OSD’s to close the file, the the client updates the MDS with any details and the MDS marks the file as closed and updates the metadata information.

Object-Based Storage
The system layout serves as a guide for further discussing the details and features of Ceph. One of the first features that is important is to be explained is the object-based approach of the file system. In an object-based file system, the data is broken into objects that are assigned an object ID number and a small amount of metadata and then sent for storage on the Object Storage Devices (OSD’s). The file system metadata for that file then consists of a number of object ID’s that define all of the data as well as other information about the file (e.g. access/modify dates, etc.). Typically, the metadata does not know know precisely where the file is located and relies on the OSD’s for the storage and retrieval of the actual data. The OSD takes care of the lower-level functions itself (kind of a “smart” hard drive if you will). The file system interacts with the OSD’s at a high-level requesting the object itself or information about the object rather than asking for a range of inodes or blocks or something similar.

While there have only been experimental OSD drives the typical way of creating an OSD is to use a middle layer of software between the object based file system and the file system on the drive itself (or even the drive itself). In this approach the drive is just a regular hard drive such as those we currently use. Typically the OSD middle layer converts the object request into a file system request on the underlying drive.

Initially Ceph used something called EBOFS (Extent and B-tree based Object File System) but support was dropped in mid-2009. It was replaced with btrfs which promises to give as good or better performance than EBOFS. In addition, btrfs has a few features that EBOFS does not. Namely,

  • Copy-on-write semantics for file data (who doesn’t like a COW?)
  • Well maintained and tested (it’s in the kernel and under heavy development)

In addition, according to the Ceph wiki,

"... To avoid reinventing the wheel, Ceph will use btrfs on individual storage nodes (OSDs) to store object data, and we will focus on adding any additional functionality needed to btrfs where it will hopefully benefit non-Ceph users as well. ..."

For example, there is a recent patch that adds some features to btrfs that help Ceph.

Distributed Metadata
Another key aspect of Ceph that distinguishes itself from other file systems is that it uses something Sage terms “Dynamic Distributed Metadata Management.” The first keyword is distributed meaning multiple metadata servers unlike Lustre which only has one metadata server. Being distributed means that the lose of a metadata server (MDS) won’t cause the entire file system to crash.

The second keyword in the title is Dynamic. This means that the metadata can actually be moved or redistributed from one MDS to another. If a MDS goes down or is added, portions of the file system directory hierarchy are moved to better balance performance and capacity. This distribution is based on the workload but preserves locality in each MDS’s workload improving performance because the metadata can be aggressively prefetched.

Dynamic metadata also means that over time the metadata is redistributed to make better use of resources including load balancing for systems that don’t even add storage hardware. So if a certain part of the directory tree was used more often than others, it can either be divided across MDS nodes or consolidated to a single MDS coupled with aggressive caching.

Reliability through Replication
Typical file systems, even distributed parallel ones, rely on data storage units that have RAID or SAN fail-over mechanisms to help maintain data access. This also includes redundant power supplies, possibly redundant RAID controllers, redundant network cards, and other costly hardware solutions. An example of this is Lustre. On the opposite of this approach is Ceph that uses replication to help maintain access to data. Ceph maintains copies of data across the OSD’s to ensure that the loss of any OSD or multiple OSD’s will not cause the loss of data. If an OSD is lost the objects that it contained are on other OSD’s and are immediately copied to other remaining OSD’s so that the proper number of copies is maintained. The copies are spread out so that no “hot spots” develop in the replication process and as much replication as possible takes place in parallel.

Using replication does mean that you use more capacity to store the same data but it also means that you don’t need parity disks or “spare” disks making 100% use of all the storage in the OSD’s. It also means that you don’t develop hot spots in the OSD’s waiting for a RAID rebuild. Moreover, since you don’t need to do a RAID rebuild you don’t need the compute power, saving money and electrical power.

Distributed Object Storage
One way to achieve better performance is to stripe data across multiple OSD’s (something like RAID-0). Ceph does this and uses replication to ensure that the lose of an OSD does not mean that the data is lost. The component of Ceph that does this is called RADOS (Reliable Autonomic Distributed Object Store). Figure 2 below presents how the data from a file is broken into objects and distributed to the OSD’s.

Figure 2: Ceph Distributed Object Storage.

A file is broken into objects and then these objects are mapped into placement groups (PG’s) using a simple hash function. Then the placement groups are assigned to OSD’s using a component of Ceph called CRUSH (Controlled Replication Under Scaling Hashing). CRUSH is a pseudo-random data distribution function that efficiently maps each PG to an ordered list of OSD’s where copies of the object are stored. One feature of CRUSH is that it is a globally known function so any component of Ceph (client, MDS, OSD) can compute the location of an object. This means that you don’t have to involve the MDS to compute the location of an object.

Relaxation of POSIX (sort of)
Ceph uses the phrase “near-POSIX” because it has the ability to relax some of the POSIX semantics to improve performance (see the recent article POSIX IO Must Die!). In particular it uses a subset of a proposed set of extensions for POSIX for HPC (High-Performance Computing).

A classic example illustrating why extensions are needed for POSIX is that when a file is opened by multiple clients (usually happens in HPC) where each client has either multiple writers or a mix of readers and writers, the metadata server will revoke any read caching and write buffering capabilities to make sure that all clients access the data correctly. This forces the client IO to suddenly become synchronous and the performance drops tremendously particularly for small files (POSIX is at least enforcing consistency – always good). However, some applications already know that they don’t have consistency issues because of the design of the application (this is common in HPC applications) but they have to suffer a severe performance penalty because POSIX has chosen to trust no one – even if the application is correct because each writer or reader works on an independent part of the file.

The proposed POSIX extensions have options to address this issue as well as others. In particular, there is an option O_LAZY that is used for an open() syscall that explicitly relaxes coherency for a shared-write file. It assumes that the application is managing it’s own coherency. As previously mentioned, in HPC many applications can read/write to a single file from many processes since each process works on an independent part of the file. Using the O_LAX option means that the applications can run at higher speeds using caching and buffering that POSIX normally allows.


Ceph has a number of features which make it very attractive for the growing file systems we all are experiencing. It has designed for scalability, reliability, and performance. At the same time is assumes that hardware will fail or have hardware added, so it has a design that can adapt to these situations. Ceph breaks the file system into two pieces: (1) metadata, and (2) data. This allows each piece to be designed in the most efficient manner to achieve these three goals of Ceph.

Ceph uses a dynamic distributed metadata server (MDS) that is not only clustered but also adapts to the changing workload. It will automatically distribute portions of the hierarchical directory tree to other MDS servers in the cluster to better load balance as the workload changes. In addition, if a MDS server is added, it will move portions of the metadata to that new box, again, better distributing the load.

The concept of replication is used along with Object Server Devices (OSD’s) so that all the space on all the drives is used (no parity drives, no spare drives). During the writing of an object to Ceph, it is automatically replicated to other OSD’s so that the loss of an OSD(s) won’t result in the loss of data. If an OSD is lost, the objects are again re-replicated so that the number of copies of the objects is maintained.

While the Ceph client was recently include in the 2.6.34 kernel (it was in a “rc” version where rc = release candidate) it is still considered not ready for prime-time. It also uses btrfs as the underlying storage mechanism for the OSD’s and btrfs itself is still in development. But including the client in the kernel does three things. First, it gives a vote of confidence to Ceph. Second, since it’s in the kernel it should get some more “development eyes” examining the code. And third, it should get more testing.

If you’re feeling “experimental” or have an upcoming need for larger amounts of storage, then give Ceph a try. It’s really not a scary octopus about to eat your boat.

Comments on "Ceph: The Distributed File System Creature from the Object Lagoon"


Sounds really good, will try it on my new project.

Can someone please help understanding what exactly BUCKET means in context to ceph & what is the physical significance of cluster map ..

Thanx in advance !

Enjoyed every bit of your article.Really looking forward to read more. Really Cool.

I loved as much as you’ll receive carried out right here.
The sketch is tasteful, your authored subject matter stylish.
nonetheless, you command get bought an edginess over that you wish be
delivering the following. unwell unquestionably come more formerly again as exactly the same nearly very often inside case you
shield this hike.

My homepage Eli’s

You have made some decent points there. I looked on the internet for more information about the issue and found most individuals will go along with your views on this website.

my homepage; pressure cleaning ct

bookmarked!!, I really like your web site!

Also visit my site :: Eli’s Power Washing Services

Hi to all, how is all, I think every one is getting more from this web page, and
your views are nice in support of new visitors.

My weblog … pressure cleaning ct – Gregory -

Below you will locate the link to some web-sites that we believe you should visit.

Thanks a lot! This process saved my night’s sleep!

of course like your website but you need to test
the spelling on quite a few of your posts. A number of them are rife with spelling problems and I find it very troublesome to tell the truth nevertheless I will certainly come back again.

I’m still learning from you, as I’m trying to reach my goals.
I absolutely love reading everything that is posted on your site.Keep the aarticles coming.
I enjoyed it!

wonderful put up, very informative. I’m wondering why the opposite specialists of this sector do not notice this. You should continue your writing. I’m confident, you’ve a huge readers’ base already!

This website online is known as a stroll-by way of for the entire information you needed about this and didn’t know who to ask. Glimpse here, and you’ll positively discover it.

Some really interesting information, well written and loosely user pleasant.

Some genuinely interesting information, well written and generally user friendly.

Usually posts some pretty exciting stuff like this. If you are new to this site.

Wow that was unusual. I just wrote an very long comment but after I clicked submit my comment didn’t appear. Grrrr… well I’m not writing all that over again. Regardless, just wanted to say superb blog!

sK0vXH These are actually wonderful ideas in regarding blogging.

What’s up, just wanted to mention, I liked this blog post. It was practical. Keep on posting!|

Nice post. I learn something more challenging on different blogs everyday. It will always be stimulating to read content from other writers and practice a little something from their store. I’d prefer to use some with the content on my blog whether you don’t mind. Natually I’ll give you a link on your web blog. Thanks for sharing.

The details talked about in the post are a few of the most effective offered.

You are my intake, I have few blogs and often run out from to brand : (.

Everything is very open with a very clear clarification of the challenges. It was truly informative. Your site is very helpful. Many thanks for sharing!|

Sites of interest we’ve a link to.

Here are some links to internet sites that we link to simply because we believe they are really worth visiting.

Please visit the web sites we comply with, which includes this a single, because it represents our picks from the web.

Please check out the websites we adhere to, which includes this one particular, as it represents our picks from the web.

Very couple of websites that occur to be detailed beneath, from our point of view are undoubtedly nicely really worth checking out.

Very nice post. I just stumbled upon your blog and wished to say that I’ve really enjoyed surfing around your blog posts. In any case I will be subscribing to your feed and I hope you write again very soon!

Usually posts some pretty fascinating stuff like this. If you?re new to this site.

Very handful of web sites that transpire to be comprehensive below, from our point of view are undoubtedly well worth checking out.

Check below, are some totally unrelated sites to ours, on the other hand, they are most trustworthy sources that we use.

Wonderful story, reckoned we could combine a handful of unrelated information, nonetheless genuinely really worth taking a appear, whoa did one understand about Mid East has got a lot more problerms too.

Always a significant fan of linking to bloggers that I love but don?t get a lot of link really like from.

The time to study or go to the content or web-sites we have linked to beneath.

excellent post, very informative. I’m wondering why the opposite specialists of thissector don’t notice this. You should continue yourwriting. I’m confident, you’ve a great readers’ base already!

Very nice write-up. I absolutely love this site. Thanks!

Below you will discover the link to some web-sites that we assume you need to visit.

We prefer to honor quite a few other world wide web websites around the net, even when they aren?t linked to us, by linking to them. Beneath are some webpages really worth checking out.

Always a significant fan of linking to bloggers that I appreciate but do not get a great deal of link really like from.

We like to honor quite a few other world wide web web pages around the web, even though they aren?t linked to us, by linking to them. Beneath are some webpages really worth checking out.

That will be the finish of this post. Right here you will obtain some web sites that we think you will enjoy, just click the hyperlinks.

One of our visitors a short while ago suggested the following website.

Below you will locate the link to some internet sites that we assume you should visit.

Usually posts some very fascinating stuff like this. If you are new to this site.

Wonderful story, reckoned we could combine a couple of unrelated information, nonetheless genuinely really worth taking a search, whoa did a single discover about Mid East has got much more problerms too.

F*ckin’ amazing things here. I’m very glad to see your post. Thanks a lot and i am looking forward to contact you. Will you kindly drop me a mail?

Usually posts some extremely fascinating stuff like this. If you are new to this site.

We came across a cool web site that you could possibly get pleasure from. Take a search when you want.

Please check out the web-sites we adhere to, such as this 1, as it represents our picks in the web.

Leave a Reply