dcsimg

OCFS2: Unappreciated Linux File System

It's common knowledge that Linux has a fair number of file systems. Some of these are unappreciated and can be very useful outside their "comfort zone". OCFS2 is a clustered file system initially contributed by Oracle and can be a great back-end file system for general, shared storage needs.

Lately I’ve been talking about NAS servers focusing on NFS. One of the difficult aspects to a NAS server is how to properly expand it in terms of capacity and performance. In many situations you are limited to the capacity provided by the server/storage which depends upon the hardware but in almost all situations you are limited to a single NFS gateway because the underlying file system is local to the server (Note: there are exceptions but they are proprietary and many times that means more money). Wouldn’t it be nice to scale the storage as needed and share it across multiple NFS gateways (or even CIFS gateways)? One approach to this is a clustered file system.

A clustered file system is one that is mounted on multiple servers at the same time. This involves some sort of shared disk (shared storage) system that is typically a SAN or an external RAID array that provides the servers direct disk access at the block level. One could easily imagine problems with clustered file systems where multiple servers are trying to access the same file or part of a file at the same time. To prevent this, typically the file system provides a mechanism for concurrent access to the file or part of the file. Note that this mechanism is absent from the more conventional local file systems. Consequently, clustered file systems can be more complicated than local file systems.

Providing the access to the storage from the servers is a block-level protocol. The typical ones are SCSI (typically used when sharing an external RAID array), iSCSI, ATA over Ethernet, InfiniBand, and Fibre Channel (FC). Linux supports all of these protocols but some hardware devices may not be supported by Linux. A very easy example is to use OpenFiler, which supports iSCSI, to provide storage for a set of servers running a clustered file system.

One such clustered file system in Linux is OCFS2 (Oracle Clustered File System – 2). The file system was developed from OCFS that was totally focused on database storage needs. OCFS2 added POSIX compliance and has other features that can be very useful for shared storage requirements outside of databases. Let’s take a quick look at OCFS2 by starting with OCFS.

OCFS (Oracle Clustered File System)

Oracle developed OCFS as a shared disk file system for use with database files from Oracle’s clustered database. This means that it had a limited, albeit very useful focus lacking POSIX compatibility (if that is important to you (link) ). It was developed under the GNU Public License (GPL) but was never part of the kernel but was used by many people running Oracle on Linux. But it has been over taken by OCFS2.

OCFS2

OCFS was overtaken by OCFS2 which added POSIX compatibility, broadening it’s appeal. One of achievements of OCFS2 beside being POSIX compatible was that it was integrated into the 2.6.16 kernel. The experimental label was removed from OCFS2 in the 2.6.19 version of the kernel.

The actual file system component of OCFS2 is inspired by ext3 but it deviates from ext3 in it’s implementation. For example, it uses the concept of extents which ext4 uses. In addition, rather than use it’s own journaling subsystem, it initially used the Linux JBD system (Journaling Block Device). JBD is 32 bits limiting OCFS2 to a file system size of 2^32 * blocksize. With a 4KB block size this means that OCFS2 was limited to 16TB.

In the 2.6.28 kernel OCFS2 switched over to the JBD2 journaling layer which fixed the 32-bit limitation of JBD. As of the writing of this article, support for file systems greater than 16 TB still isn’t fully activated according to Sunil Mushran, an OCFS2 developer. The file system is theoretically capable of greater than 16 TB (the theoretical limit is about 4PB’s) but according to Sunil it hasn’t had sustained testing beyond 16 TB. However he believes that support for file systems beyond 16TB will happen sometime this year.

One of the keys to OCFS2 is the lock manager. A distributed lock manager plays traffic cop when various processes are accessing the same file or data range. In essence the lock manager gives processes read-only or write access to various parts of files. This prevents the possibility of a process overwriting data that it should not. The lock manager in OCFS2 is distributed since OCFS2 itself is distributed so that if a node goes down the entire file system doesn’t go down. Generically the lock manager is called DLM (Distributed Lock Manager) although there are several DLM’s including the one in OCFS2.

Making a lock manager working in a distributed environment may sound easy it definitely is not. One of the key reasons is that it has to work correctly over the network that connects the nodes tolerating whatever latency the network imposes. Moreover, it must work correctly as nodes are added to, or removed from, the group, or in the event that a node that holds a lock goes down (and perhaps come back up) it needs to adapt. It is a non-trivial problem but it holds the key for clustered file systems.

OCFS2 has a wide range of additional features listed here.


  • OCFS2 is endian neutral allowing little endian (e.g. x86 and x86_64) and big endian (e.g. PowerPC) architectures to exist as well as 32-bit and 64-bit mounts
  • It supports direct IO, asynchronous IO, buffered IO, splice IO (added in version 1.4), and memory mapped IO
  • It supports large inodes which are good for small files because the files can be stored in the inode itself (better performance)
  • It tracks modify time (mtime), access time (atime), and attribute modification time (ctime). It can also use relative atime (relative access time).
  • It supports ordered and writeback journal modes
  • It supports Unix style permissions as well as ACLs (link)
  • In the 2.6.29 kernel it added metadata checksums (reduces metadata corruption possibilities)
  • It has variable block sizes (512 Bytes, 1KB, 2KB, and 4KB) with 4KB being the recommended size
  • It has a variable cluster size where the cluster size is the unit of space allocated for file data. The options in version 1.4 of OCFS2 are 4KB, 8KB, 16KB, 32KB, 64KB, 128KB, 256KB, 512KB, and 1MB. The recommended size is 4KB.

There is also a set of user tools for OCFS2. There are two packages: ocfs-tools which is a command line interface, and ocfs2console which is a GUI. The documentation suggest using the GUI to configure a clustered file system but many of the HOWTO’s do it by hand (the configuration file is quite easy to read).

While OCFS2 is a clustered file system the volumes can be mounted as clustered volumes or as local (single-node) volumes. if you need more performance, you may need to add more nodes to the OCFS2 cluster. Adding a node to an existing system is fairly easy. According to Sunil, “Adding nodes to a cluster is a straightforward process. You don’t have to unmount to add. Just leave it mounted on the node(s). If the new node is already registered, then just mount the volume on the new node. Mounting will kick-start the heartbeat, which will trigger the other nodes to connect to the new node. Once connected, the new node can then join the DLM domain, etc. All this handled transparently by the mount command.” The OCFS2 guide has instructions on how to do this including nodes that are not registered.

When/Where You Can Use OCFS2 Outside of Databases

The title of the article is “OCFS2 – An Unappreciated Linux File System” because while OCFS2 is arguable the best file system for Oracle, it’s POSIX compatibility and flexibility make it useful for uses other than databases.

The most obvious use case is to create a clustered file system between two nodes. You simply take two nodes that share some storage (perhaps FC or iSCSI or SCSI) and share a common TCP network, and you can configure OCFS2 across the two nodes. You can share data between the two nodes via OCFS2 but only the two nodes. You can grow the cluster and add more nodes but you are limited to sharing data only between the OCFS2 nodes. Despite our desires OCFS2 does have scaling limitations based on our performance requirements, the bandwidth and latency of the network connecting the nodes, the bandwidth and latency of the storage subsystem, and the CPU and memory of each node. So the number of nodes is practically driven by what kind of performance we desire from the filesystem/storage combination.

To share data with more systems we can easily use the OCFS2 nodes as NFS gateways for other clients. In this case each node of the OCFS2 file system should be fairly beefy (i.e. more than 2 cores) because it not only functions as an IO node (i.e. part of the OCFS2 file system) but it also functions as an NFS gateway. If you have more than one node in OCFS2 then you can use each one as an NFS gateway, equally dividing the clients between the nodes. This helps reduce the load on a particular gateway but imposes additional work on the administrator. Typically they have to divide the clients among the gateways but you can also use DNS round-robin-ing to reduce this administrative burden but it imposes additional work in the initial configuration of the DNS.

An advantage of using OCFS2 as the back end file system for NFS is that if you need more capacity you just add more storage. If you need more overall performance you can add another node to the OCFS2 cluster which can than be an NFS gateway. While this doesn’t improve the individual client performance it increases the overall aggregate performance.

Summary

OCFS2 is very well known file system in the world of databases, particularly for Oracle. But it is often under appreciated by the general Linux storage world, perhaps not realizing that it is POSIX compliant. It has a number of great features as a single node file system such as extents and metadata checksums but perhaps more importantly it is clustered file system. This means that some sort of shared storage must be used such as SAN or external RAID arrays.

For general Linux storage functions you can use OCFS2 as the back end storage for NFS. This allows you to easily grow the capacity by adding storage to the SAN. Each OCFS2 node can act as an NFS server (gateway) with their own set of clients. This allows you to load-balance NFS traffic between the OCFS2 nodes although that’s really something of a manual process (i.e. you have to balance the clients using more storage across the OCFS2 ode). And just in case – you should be able to install Samba and use OCFS2 as a backend storage for CIFS traffic (I don’t know if this is possible but I haven’t seen any roadblocks). If you need commercial support for it, Oracle provides that as well.

Give OCFS2 a chance. Take a look, try it on some test machines, and see how it satisfies your requirements. I think you will be surprised.

Comments on "OCFS2: Unappreciated Linux File System"

Wonderful story, reckoned we could combine a few unrelated data, nevertheless definitely worth taking a look, whoa did a single master about Mid East has got far more problerms as well.

Usually posts some extremely fascinating stuff like this. If you are new to this site.

Here is a superb Blog You may Come across Exciting that we encourage you to visit.

Below you?ll uncover the link to some internet sites that we consider it is best to visit.

Here are some hyperlinks to web sites that we link to mainly because we think they’re worth visiting.

Wonderful story, reckoned we could combine a number of unrelated information, nevertheless genuinely really worth taking a appear, whoa did one discover about Mid East has got additional problerms at the same time.

One of our guests just lately encouraged the following website.

Here are some hyperlinks to websites that we link to due to the fact we consider they may be really worth visiting.

Here are some hyperlinks to web sites that we link to simply because we think they are worth visiting.

Here are some links to websites that we link to because we feel they are worth visiting.

Usually posts some pretty exciting stuff like this. If you?re new to this site.

Wonderful story, reckoned we could combine a few unrelated data, nonetheless seriously worth taking a search, whoa did 1 master about Mid East has got a lot more problerms as well.

The time to study or pay a visit to the content or web-sites we have linked to below.

Usually posts some really fascinating stuff like this. If you are new to this site.

One of our visitors lately advised the following website.

Although websites we backlink to beneath are considerably not related to ours, we feel they are truly really worth a go by way of, so have a look.

Hmm is anyone else encountering problems with the images on this blog loading?
I’m trying to figure out if its a problem on my end or if it’s the blog.
Any feedback would be greatly appreciated.

Here is my web site … TelmaDThobbs

Check below, are some absolutely unrelated web-sites to ours, however, they are most trustworthy sources that we use.

Usually posts some quite interesting stuff like this. If you?re new to this site.

The time to study or pay a visit to the material or web sites we have linked to below.

Although internet websites we backlink to below are considerably not related to ours, we feel they may be in fact worth a go through, so have a look.

Below you will come across the link to some internet sites that we believe you should visit.

The facts mentioned in the write-up are several of the most beneficial obtainable.

Just beneath, are many totally not associated web sites to ours, on the other hand, they’re certainly really worth going over.

Please go to the web pages we follow, like this a single, because it represents our picks in the web.

Please check out the websites we adhere to, such as this a single, as it represents our picks in the web.

The facts talked about within the article are several of the most beneficial available.

Very handful of sites that happen to become detailed below, from our point of view are undoubtedly effectively really worth checking out.

Please pay a visit to the web pages we comply with, like this one particular, because it represents our picks in the web.

Please pay a visit to the web pages we adhere to, such as this 1, as it represents our picks through the web.

One of our visitors not too long ago proposed the following website.

The information and facts talked about inside the article are a few of the ideal obtainable.

Here is a good Blog You might Uncover Exciting that we encourage you to visit.

One of our visitors lately recommended the following website.

We came across a cool web site that you simply could appreciate. Take a appear in case you want.

Although websites we backlink to beneath are considerably not associated to ours, we feel they are in fact worth a go through, so have a look.

Here are several of the sites we advise for our visitors.

Here is a superb Weblog You might Locate Interesting that we encourage you to visit.

The information and facts mentioned inside the write-up are a few of the best readily available.

We prefer to honor numerous other net websites on the web, even when they aren?t linked to us, by linking to them. Beneath are some webpages worth checking out.

The information talked about inside the article are a few of the most beneficial obtainable.

Below you will find the link to some web-sites that we think it is best to visit.

Wonderful story, reckoned we could combine a handful of unrelated data, nevertheless genuinely really worth taking a look, whoa did 1 learn about Mid East has got far more problerms as well.

Here are some links to sites that we link to because we consider they’re really worth visiting.

Below you?ll locate the link to some web-sites that we assume you’ll want to visit.

The time to study or stop by the subject material or web-sites we have linked to below.

One of our guests lately proposed the following website.

That could be the end of this post. Here you?ll obtain some web-sites that we assume you?ll value, just click the links.

The information and facts talked about inside the post are several of the most beneficial available.

That will be the finish of this article. Right here you?ll come across some web-sites that we assume you?ll value, just click the links.

Leave a Reply