It's common knowledge that Linux has a fair number of file systems. Some of these are unappreciated and can be very useful outside their "comfort zone". OCFS2 is a clustered file system initially contributed by Oracle and can be a great back-end file system for general, shared storage needs.
Lately I’ve been talking about NAS servers focusing on NFS. One of the difficult aspects to a NAS server is how to properly expand it in terms of capacity and performance. In many situations you are limited to the capacity provided by the server/storage which depends upon the hardware but in almost all situations you are limited to a single NFS gateway because the underlying file system is local to the server (Note: there are exceptions but they are proprietary and many times that means more money). Wouldn’t it be nice to scale the storage as needed and share it across multiple NFS gateways (or even CIFS gateways)? One approach to this is a clustered file system.
A clustered file system is one that is mounted on multiple servers at the same time. This involves some sort of shared disk (shared storage) system that is typically a SAN or an external RAID array that provides the servers direct disk access at the block level. One could easily imagine problems with clustered file systems where multiple servers are trying to access the same file or part of a file at the same time. To prevent this, typically the file system provides a mechanism for concurrent access to the file or part of the file. Note that this mechanism is absent from the more conventional local file systems. Consequently, clustered file systems can be more complicated than local file systems.
Providing the access to the storage from the servers is a block-level protocol. The typical ones are SCSI (typically used when sharing an external RAID array), iSCSI, ATA over Ethernet, InfiniBand, and Fibre Channel (FC). Linux supports all of these protocols but some hardware devices may not be supported by Linux. A very easy example is to use OpenFiler, which supports iSCSI, to provide storage for a set of servers running a clustered file system.
One such clustered file system in Linux is OCFS2 (Oracle Clustered File System – 2). The file system was developed from OCFS that was totally focused on database storage needs. OCFS2 added POSIX compliance and has other features that can be very useful for shared storage requirements outside of databases. Let’s take a quick look at OCFS2 by starting with OCFS.
OCFS (Oracle Clustered File System)
Oracle developed OCFS as a shared disk file system for use with database files from Oracle’s clustered database. This means that it had a limited, albeit very useful focus lacking POSIX compatibility (if that is important to you (link) ). It was developed under the GNU Public License (GPL) but was never part of the kernel but was used by many people running Oracle on Linux. But it has been over taken by OCFS2.
OCFS was overtaken by OCFS2 which added POSIX compatibility, broadening it’s appeal. One of achievements of OCFS2 beside being POSIX compatible was that it was integrated into the 2.6.16 kernel. The experimental label was removed from OCFS2 in the 2.6.19 version of the kernel.
The actual file system component of OCFS2 is inspired by ext3 but it deviates from ext3 in it’s implementation. For example, it uses the concept of extents which ext4 uses. In addition, rather than use it’s own journaling subsystem, it initially used the Linux JBD system (Journaling Block Device). JBD is 32 bits limiting OCFS2 to a file system size of 2^32 * blocksize. With a 4KB block size this means that OCFS2 was limited to 16TB.
In the 2.6.28 kernel OCFS2 switched over to the JBD2 journaling layer which fixed the 32-bit limitation of JBD. As of the writing of this article, support for file systems greater than 16 TB still isn’t fully activated according to Sunil Mushran, an OCFS2 developer. The file system is theoretically capable of greater than 16 TB (the theoretical limit is about 4PB’s) but according to Sunil it hasn’t had sustained testing beyond 16 TB. However he believes that support for file systems beyond 16TB will happen sometime this year.
One of the keys to OCFS2 is the lock manager. A distributed lock manager plays traffic cop when various processes are accessing the same file or data range. In essence the lock manager gives processes read-only or write access to various parts of files. This prevents the possibility of a process overwriting data that it should not. The lock manager in OCFS2 is distributed since OCFS2 itself is distributed so that if a node goes down the entire file system doesn’t go down. Generically the lock manager is called DLM (Distributed Lock Manager) although there are several DLM’s including the one in OCFS2.
Making a lock manager working in a distributed environment may sound easy it definitely is not. One of the key reasons is that it has to work correctly over the network that connects the nodes tolerating whatever latency the network imposes. Moreover, it must work correctly as nodes are added to, or removed from, the group, or in the event that a node that holds a lock goes down (and perhaps come back up) it needs to adapt. It is a non-trivial problem but it holds the key for clustered file systems.
OCFS2 has a wide range of additional features listed here.
- OCFS2 is endian neutral allowing little endian (e.g. x86 and x86_64) and big endian (e.g. PowerPC) architectures to exist as well as 32-bit and 64-bit mounts
- It supports direct IO, asynchronous IO, buffered IO, splice IO (added in version 1.4), and memory mapped IO
- It supports large inodes which are good for small files because the files can be stored in the inode itself (better performance)
- It tracks modify time (mtime), access time (atime), and attribute modification time (ctime). It can also use relative atime (relative access time).
- It supports ordered and writeback journal modes
- It supports Unix style permissions as well as ACLs (link)
- In the 2.6.29 kernel it added metadata checksums (reduces metadata corruption possibilities)
- It has variable block sizes (512 Bytes, 1KB, 2KB, and 4KB) with 4KB being the recommended size
- It has a variable cluster size where the cluster size is the unit of space allocated for file data. The options in version 1.4 of OCFS2 are 4KB, 8KB, 16KB, 32KB, 64KB, 128KB, 256KB, 512KB, and 1MB. The recommended size is 4KB.
There is also a set of user tools for OCFS2. There are two packages:
which is a command line interface, and
which is a GUI. The documentation suggest using the GUI to configure a clustered file system but many of the HOWTO’s do it by hand (the configuration file is quite easy to read).
While OCFS2 is a clustered file system the volumes can be mounted as clustered volumes or as local (single-node) volumes. if you need more performance, you may need to add more nodes to the OCFS2 cluster. Adding a node to an existing system is fairly easy. According to Sunil, “Adding nodes to a cluster is a straightforward process. You don’t have to unmount to add. Just leave it mounted on the node(s). If the new node is already registered, then just mount the volume on the new node. Mounting will kick-start the heartbeat, which will trigger the other nodes to connect to the new node. Once connected, the new node can then join the DLM domain, etc. All this handled transparently by the mount command.” The OCFS2 guide has instructions on how to do this including nodes that are not registered.
When/Where You Can Use OCFS2 Outside of Databases
The title of the article is “OCFS2 – An Unappreciated Linux File System” because while OCFS2 is arguable the best file system for Oracle, it’s POSIX compatibility and flexibility make it useful for uses other than databases.
The most obvious use case is to create a clustered file system between two nodes. You simply take two nodes that share some storage (perhaps FC or iSCSI or SCSI) and share a common TCP network, and you can configure OCFS2 across the two nodes. You can share data between the two nodes via OCFS2 but only the two nodes. You can grow the cluster and add more nodes but you are limited to sharing data only between the OCFS2 nodes. Despite our desires OCFS2 does have scaling limitations based on our performance requirements, the bandwidth and latency of the network connecting the nodes, the bandwidth and latency of the storage subsystem, and the CPU and memory of each node. So the number of nodes is practically driven by what kind of performance we desire from the filesystem/storage combination.
To share data with more systems we can easily use the OCFS2 nodes as NFS gateways for other clients. In this case each node of the OCFS2 file system should be fairly beefy (i.e. more than 2 cores) because it not only functions as an IO node (i.e. part of the OCFS2 file system) but it also functions as an NFS gateway. If you have more than one node in OCFS2 then you can use each one as an NFS gateway, equally dividing the clients between the nodes. This helps reduce the load on a particular gateway but imposes additional work on the administrator. Typically they have to divide the clients among the gateways but you can also use DNS round-robin-ing to reduce this administrative burden but it imposes additional work in the initial configuration of the DNS.
An advantage of using OCFS2 as the back end file system for NFS is that if you need more capacity you just add more storage. If you need more overall performance you can add another node to the OCFS2 cluster which can than be an NFS gateway. While this doesn’t improve the individual client performance it increases the overall aggregate performance.
OCFS2 is very well known file system in the world of databases, particularly for Oracle. But it is often under appreciated by the general Linux storage world, perhaps not realizing that it is POSIX compliant. It has a number of great features as a single node file system such as extents and metadata checksums but perhaps more importantly it is clustered file system. This means that some sort of shared storage must be used such as SAN or external RAID arrays.
For general Linux storage functions you can use OCFS2 as the back end storage for NFS. This allows you to easily grow the capacity by adding storage to the SAN. Each OCFS2 node can act as an NFS server (gateway) with their own set of clients. This allows you to load-balance NFS traffic between the OCFS2 nodes although that’s really something of a manual process (i.e. you have to balance the clients using more storage across the OCFS2 ode). And just in case – you should be able to install Samba and use OCFS2 as a backend storage for CIFS traffic (I don’t know if this is possible but I haven’t seen any roadblocks). If you need commercial support for it, Oracle provides that as well.
Give OCFS2 a chance. Take a look, try it on some test machines, and see how it satisfies your requirements. I think you will be surprised.
Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).