dcsimg

OCFS2: Unappreciated Linux File System

It's common knowledge that Linux has a fair number of file systems. Some of these are unappreciated and can be very useful outside their "comfort zone". OCFS2 is a clustered file system initially contributed by Oracle and can be a great back-end file system for general, shared storage needs.

Lately I’ve been talking about NAS servers focusing on NFS. One of the difficult aspects to a NAS server is how to properly expand it in terms of capacity and performance. In many situations you are limited to the capacity provided by the server/storage which depends upon the hardware but in almost all situations you are limited to a single NFS gateway because the underlying file system is local to the server (Note: there are exceptions but they are proprietary and many times that means more money). Wouldn’t it be nice to scale the storage as needed and share it across multiple NFS gateways (or even CIFS gateways)? One approach to this is a clustered file system.

A clustered file system is one that is mounted on multiple servers at the same time. This involves some sort of shared disk (shared storage) system that is typically a SAN or an external RAID array that provides the servers direct disk access at the block level. One could easily imagine problems with clustered file systems where multiple servers are trying to access the same file or part of a file at the same time. To prevent this, typically the file system provides a mechanism for concurrent access to the file or part of the file. Note that this mechanism is absent from the more conventional local file systems. Consequently, clustered file systems can be more complicated than local file systems.

Providing the access to the storage from the servers is a block-level protocol. The typical ones are SCSI (typically used when sharing an external RAID array), iSCSI, ATA over Ethernet, InfiniBand, and Fibre Channel (FC). Linux supports all of these protocols but some hardware devices may not be supported by Linux. A very easy example is to use OpenFiler, which supports iSCSI, to provide storage for a set of servers running a clustered file system.

One such clustered file system in Linux is OCFS2 (Oracle Clustered File System – 2). The file system was developed from OCFS that was totally focused on database storage needs. OCFS2 added POSIX compliance and has other features that can be very useful for shared storage requirements outside of databases. Let’s take a quick look at OCFS2 by starting with OCFS.

OCFS (Oracle Clustered File System)

Oracle developed OCFS as a shared disk file system for use with database files from Oracle’s clustered database. This means that it had a limited, albeit very useful focus lacking POSIX compatibility (if that is important to you (link) ). It was developed under the GNU Public License (GPL) but was never part of the kernel but was used by many people running Oracle on Linux. But it has been over taken by OCFS2.

OCFS2

OCFS was overtaken by OCFS2 which added POSIX compatibility, broadening it’s appeal. One of achievements of OCFS2 beside being POSIX compatible was that it was integrated into the 2.6.16 kernel. The experimental label was removed from OCFS2 in the 2.6.19 version of the kernel.

The actual file system component of OCFS2 is inspired by ext3 but it deviates from ext3 in it’s implementation. For example, it uses the concept of extents which ext4 uses. In addition, rather than use it’s own journaling subsystem, it initially used the Linux JBD system (Journaling Block Device). JBD is 32 bits limiting OCFS2 to a file system size of 2^32 * blocksize. With a 4KB block size this means that OCFS2 was limited to 16TB.

In the 2.6.28 kernel OCFS2 switched over to the JBD2 journaling layer which fixed the 32-bit limitation of JBD. As of the writing of this article, support for file systems greater than 16 TB still isn’t fully activated according to Sunil Mushran, an OCFS2 developer. The file system is theoretically capable of greater than 16 TB (the theoretical limit is about 4PB’s) but according to Sunil it hasn’t had sustained testing beyond 16 TB. However he believes that support for file systems beyond 16TB will happen sometime this year.

One of the keys to OCFS2 is the lock manager. A distributed lock manager plays traffic cop when various processes are accessing the same file or data range. In essence the lock manager gives processes read-only or write access to various parts of files. This prevents the possibility of a process overwriting data that it should not. The lock manager in OCFS2 is distributed since OCFS2 itself is distributed so that if a node goes down the entire file system doesn’t go down. Generically the lock manager is called DLM (Distributed Lock Manager) although there are several DLM’s including the one in OCFS2.

Making a lock manager working in a distributed environment may sound easy it definitely is not. One of the key reasons is that it has to work correctly over the network that connects the nodes tolerating whatever latency the network imposes. Moreover, it must work correctly as nodes are added to, or removed from, the group, or in the event that a node that holds a lock goes down (and perhaps come back up) it needs to adapt. It is a non-trivial problem but it holds the key for clustered file systems.

OCFS2 has a wide range of additional features listed here.


  • OCFS2 is endian neutral allowing little endian (e.g. x86 and x86_64) and big endian (e.g. PowerPC) architectures to exist as well as 32-bit and 64-bit mounts
  • It supports direct IO, asynchronous IO, buffered IO, splice IO (added in version 1.4), and memory mapped IO
  • It supports large inodes which are good for small files because the files can be stored in the inode itself (better performance)
  • It tracks modify time (mtime), access time (atime), and attribute modification time (ctime). It can also use relative atime (relative access time).
  • It supports ordered and writeback journal modes
  • It supports Unix style permissions as well as ACLs (link)
  • In the 2.6.29 kernel it added metadata checksums (reduces metadata corruption possibilities)
  • It has variable block sizes (512 Bytes, 1KB, 2KB, and 4KB) with 4KB being the recommended size
  • It has a variable cluster size where the cluster size is the unit of space allocated for file data. The options in version 1.4 of OCFS2 are 4KB, 8KB, 16KB, 32KB, 64KB, 128KB, 256KB, 512KB, and 1MB. The recommended size is 4KB.

There is also a set of user tools for OCFS2. There are two packages: ocfs-tools which is a command line interface, and ocfs2console which is a GUI. The documentation suggest using the GUI to configure a clustered file system but many of the HOWTO’s do it by hand (the configuration file is quite easy to read).

While OCFS2 is a clustered file system the volumes can be mounted as clustered volumes or as local (single-node) volumes. if you need more performance, you may need to add more nodes to the OCFS2 cluster. Adding a node to an existing system is fairly easy. According to Sunil, “Adding nodes to a cluster is a straightforward process. You don’t have to unmount to add. Just leave it mounted on the node(s). If the new node is already registered, then just mount the volume on the new node. Mounting will kick-start the heartbeat, which will trigger the other nodes to connect to the new node. Once connected, the new node can then join the DLM domain, etc. All this handled transparently by the mount command.” The OCFS2 guide has instructions on how to do this including nodes that are not registered.

When/Where You Can Use OCFS2 Outside of Databases

The title of the article is “OCFS2 – An Unappreciated Linux File System” because while OCFS2 is arguable the best file system for Oracle, it’s POSIX compatibility and flexibility make it useful for uses other than databases.

The most obvious use case is to create a clustered file system between two nodes. You simply take two nodes that share some storage (perhaps FC or iSCSI or SCSI) and share a common TCP network, and you can configure OCFS2 across the two nodes. You can share data between the two nodes via OCFS2 but only the two nodes. You can grow the cluster and add more nodes but you are limited to sharing data only between the OCFS2 nodes. Despite our desires OCFS2 does have scaling limitations based on our performance requirements, the bandwidth and latency of the network connecting the nodes, the bandwidth and latency of the storage subsystem, and the CPU and memory of each node. So the number of nodes is practically driven by what kind of performance we desire from the filesystem/storage combination.

To share data with more systems we can easily use the OCFS2 nodes as NFS gateways for other clients. In this case each node of the OCFS2 file system should be fairly beefy (i.e. more than 2 cores) because it not only functions as an IO node (i.e. part of the OCFS2 file system) but it also functions as an NFS gateway. If you have more than one node in OCFS2 then you can use each one as an NFS gateway, equally dividing the clients between the nodes. This helps reduce the load on a particular gateway but imposes additional work on the administrator. Typically they have to divide the clients among the gateways but you can also use DNS round-robin-ing to reduce this administrative burden but it imposes additional work in the initial configuration of the DNS.

An advantage of using OCFS2 as the back end file system for NFS is that if you need more capacity you just add more storage. If you need more overall performance you can add another node to the OCFS2 cluster which can than be an NFS gateway. While this doesn’t improve the individual client performance it increases the overall aggregate performance.

Summary

OCFS2 is very well known file system in the world of databases, particularly for Oracle. But it is often under appreciated by the general Linux storage world, perhaps not realizing that it is POSIX compliant. It has a number of great features as a single node file system such as extents and metadata checksums but perhaps more importantly it is clustered file system. This means that some sort of shared storage must be used such as SAN or external RAID arrays.

For general Linux storage functions you can use OCFS2 as the back end storage for NFS. This allows you to easily grow the capacity by adding storage to the SAN. Each OCFS2 node can act as an NFS server (gateway) with their own set of clients. This allows you to load-balance NFS traffic between the OCFS2 nodes although that’s really something of a manual process (i.e. you have to balance the clients using more storage across the OCFS2 ode). And just in case – you should be able to install Samba and use OCFS2 as a backend storage for CIFS traffic (I don’t know if this is possible but I haven’t seen any roadblocks). If you need commercial support for it, Oracle provides that as well.

Give OCFS2 a chance. Take a look, try it on some test machines, and see how it satisfies your requirements. I think you will be surprised.

Comments on "OCFS2: Unappreciated Linux File System"

txster

One things I\’ve seen OCFS2 bad at is dealing with tons of small files that change a lot. A client has setup OCFS2 on two nodes to store the company\’s mail system, storing data in a Maildir structure. After 3-4 months (and this is a rather small, <50 ppl, place) it just stops working! After troubleshooting we\’ve found out that the OCFS2 can\’t allocate its own \’inodes\’ anymore. Usually it auto extends, but it needs 1M of contiguous space to do it… Apparently the fs was so fragmented that this was impossible ! A bit annoying, to say the least.

Reply
zapman449

OCFS2 v1.4.x seems to work great for large files, and small files that don\’t change much. But if you\’re in a small file, high turnover situation, there are some nasty fragmentation bugs out there. You need to be ready to track the latest linux kernels, and also be prepared to rebuild your filesystems.

Reply
jtmcdole

There wasn\’t any mention of distributed parity or other forms of data protection if a node goes down. Does OCFS2 support this?

Reply
elielmsouza

Well, we use OCFS2 with 5 nodes. We received 200.000 files per day with 8k of size. Our biggest problem was fragmentation, but creating inodes to full the inodes table works very well.
Did two years we have implemented OCFS2.
Excuse my poor English….

Reply
alcachi

One of the great advantages of OCFS2 is that it supports mounting loopback images. Several years ago we tried to set up GFS and came into this issue because it did not support this feature. Basically this meant that we could not run Xen virtual machine images from there. Then we discovered OCFS2, it was easier to set up and it allowed us to run our Xen VM directly from there.

Being a clustered filesystem it also allowed us to do live migration of virtual machines between different servers, what was also great.

Reply

name, status, max(start_time) start_time,max(end_time) end_time
from RC_RMAN_BACKUP_JOB_DETAILS b
where start_time < sysdate -1
group by db_name,status ) backup_type,
(select db_name,command_id, max(start_time) start_time, max(end_time) end_time
from RC_RMAN_BACKUP_JOB_DETAILS B
WHERE start_time < sysdate -1
group by db_name,command_id) backup_status ,
(select db_name,time_taken_display, max(start_time) start_time, max(end_time) end_time
from RC_RMAN_BACKUP_JOB_DETAILS B
WHERE end_time < sysdate -1
group by db_name,time_taken_display ) backup_duration,
(select db_name,input_bytes, max(start_time) start_time, max(end_time) end_time
from RC_RMAN_BACKUP_JOB_DETAILS B
WHERE end_time < sysdate -1
group by db_name,input_bytes) input_bytes,
(select db_name,output_bytes, max(start_time) start_time, max(end_time) end_time
from RC_RMAN_BACKUP_JOB_DETAILS B
WHERE end_time < sysdate -1
group by db_name,output_bytes) output_bytes
where maxdt.db_name = backup_status.db_name
and maxdt.db_name = backup_type.db_name
and maxdt.start_time = backup_status.start_time
and maxdt.start_time = backup_duration.start_time
and maxdt.start_time = input_bytes.start_time
and maxdt.start_time = output_bytes.start_time
and maxdt.start_time = backup_type.start_time) backups
where dbs.db_name = backups.db_name (+)

Reply

Hey, thanks for the post.Much thanks again. Want more.

Reply

Wow, great blog post. Will read on…

Reply

I feel this is one of the such a lot significant information for me. And i’m happy studying your article. But wanna commentary on few normal issues, The site taste is great, the articles is actually nice : D. Just right process, cheers

Reply

Its such as you learn my thoughts! You appear to understand a lot about this, like you wrote the e-book in it or something. I feel that you can do with some percent to pressure the message home a bit, however instead of that, this is excellent blog. A fantastic read. I’ll certainly be back.

Reply

Microsoft has plans, especially in the realm of games, but I am not sure I ad want to bet on the future if this aspect is important to you. The iPod is a much better choice in that case.

Reply

Sites of interest we’ve a link to.

Reply

3QOAVZ Thank you ever so for you article post.Really looking forward to read more. Want more.

Reply

Usually posts some quite interesting stuff like this. If you?re new to this site.

Reply

Usually posts some quite intriguing stuff like this. If you are new to this site.

Reply

Just beneath, are several absolutely not connected web pages to ours, on the other hand, they may be surely worth going over.

Reply

Every after inside a although we pick blogs that we study. Listed beneath would be the most recent internet sites that we select.

Reply

Although sites we backlink to beneath are considerably not related to ours, we really feel they’re really really worth a go via, so have a look.

Reply

Here are a number of the web-sites we suggest for our visitors.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>