dcsimg

Extended File Attributes Rock!

Worldwide, data is growing at a tremendous rate. However, one recent study has pointed out that the size of files is not necessarily growing at the same rate; meaning the number of files is growing rapidly. How do we manage all of this data and files? While the answer to that question is complex, one place we can start is with Extended File Attributes.

Introduction

I think it’s a given that the amount of data is increasing at a fairly fast rate. We now have lots of multimedia on our desktops, and lots of files on our servers at work, and we’re starting to put lots of data into the cloud (e.g. Facebook). One question that affects storage design and performance is if these files are large or small and how many of them are there?

At this year’s FAST (USENIX Conference on File System and Storage Technologies) the best paper went to “A Study of Practical Deduplication” by William Bolosky from Microsoft Research, and Dutch Meyer from the University of British Columbia. While the paper didn’t really cover Linux (it covered Windows) and it was more focused on desktops, and it was focused on deduplication, it did present some very enlightening insights on file systems from 2000 to 2010. Some of the highlights from the paper are:


  1. The median file size isn’t changing
  2. The average file size is larger
  3. The average file system capacity has tripled from 2000 to 2010

To fully understand the difference between the first point and the third point you need to remember some basic statistics. The average file size is computed by summing the size of every file and dividing by the number of files. But the median file size is found by ordering the list from the smallest to largest of the file size of every file. The median file size is the one in the middle of the ordered list. So, with these working definitions, the three observations previously mentioned indicate that perhaps desktops have a few really large files that drive up the average file size but at the same time there are a number of small files that makes the median file size about the same despite the increase in the number of files and the increase in large files.

The combination of the observations previously mentioned mean that we have many more files on our desktops and we are adding some really large files and about the same number of small files.

Yes, it’s Windows. Yes, it’s desktops. But these observations are another good data point that tell us something about our data. That is, the number of files is getting larger while we are adding some very large files and a large number of small files. What does this mean for us? One thing that it means to me is that we need to pay much more attention to managing our data.

Data Management – Who’s on First?

One of the keys to data management is being able to monitor the state of your data which usually means monitoring the metadata. Fortunately, POSIX gives us some standard metadata for our files such as the following:


  • File ownership (User ID and Group ID)
  • File permissions (world, group, user)
  • File times (atime, ctime, mtime)
  • File size
  • File name
  • Is it a true file or a directory?

There are several others (e.g. links) which I didn’t mention here.

With this information we can monitor the basic state of our data. We can compute how quickly our data is changing (how many files have been modified, created, deleted in a certain period of time). We can also determine how our data is “aging” – that is how old is the average file, the median file, and we can do this for the entire file system tree or certain parts of it. In essence we can get a good statistical overview of the “state of our data”.

All of this capability is just great and goes far beyond anything that is available today. However, with the file system capacity increasing so rapidly and the median file size staying about the same, we have a lot more files to monitor. Plus we keep data around for longer than we ever have. Perhaps over time it is easy to forget what a file name means or what is contained in a cryptic file name. Since POSIX is good enough to give some basic metadata wouldn’t it be nice to have the ability to add our own metadata? Something that we control that would allow is to add information about the data?

Extended File Attributes

What many people don’t realize is that there actually is a mechanism for adding your own metadata to files that is supported by most Linux file systems. This is called Extended File Attributes. In Linux, many file systems support it such as the following: ext2, ext3, ext4, jfs, xfs, reiserfs, btrfs, ocfs2 (2.1 and greater), and squashfs (kernel 2.6.35 and greater or a backport to an older kernel). Some of the file systems have restrictions on extended file attributes, such as the amount of data that can be added, but they do allow for the addition of user controlled metadata.

Any regular file that uses one of the previously mentioned extended file attributes may have a list of extended file attributes. The attributes have a name and some associated data (the actual attribute). The name starts with what is called a namespace identifier (more on that later), followed by a dot “.”, and then followed by a null-terminated string. You can add as many names separated by dots as you like to create “classes” of attributes.

Currently on Linux there are four namespaces for extended file attributes:


  1. user
  2. trusted
  3. security
  4. system

This article will focus on the “user” namespace since it has no restrictions with regard to naming or contents. However, the “system” namespace could be used for adding metadata controlled by root.

The system namespace is used primarily by the kernel for access control lists (ACLs) and can only be set by root. For example, it will use names such as “system.posix_acl_access” and “system.posix_acl_default” for extended file attributes. The general wisdom is that unless you are using ACLs to store additional metadata, which you can do, you should not use the system namespace. However, I believe that the system namespace is a place for metadata controlled by root or metadata that is immutable with respect to the users.

The security namespace is used by SELinux. An example of a name in this namespace would be something such as “security.selinux”.

The user attributes are meant to be used by the user and any application run by the user. The user namespace attributes are protected by the normal Unix user permission settings on the file. If you have write permission on the file then you can set an extended attribute. To give you an idea of what you can do for “names” for the extended file attributes for this namespace, here are some examples:


  • user.checksum.md5
  • user.checksum.sha1
  • user.checksum.sha256
  • user.original_author
  • user.application
  • user.project
  • user.comment

The first three example names are used for storing checksums about the file using three different checksum methods. The fourth example lists the originating author which can be useful in case multiple people have write access to the file or the original author leaves and the file is assigned to another user. The fifth name example can list the application that was used to generate the data such as output from an application. The sixth example lists the project that the data with which the data is associated. And the seventh example is the all-purpose general comment. From these few examples, you see that you can create some very useful metadata.

Tools for Extended File Attributes

There are several very useful tools for manipulating (setting, getting) extended attributes. These are usually included in the attr package that comes with most distributions. So be sure that this package is installed on the system.

The second thing you should check is that the kernel has attribute support. This should be turned on for almost every distribution that you might use, although there may be some very specialized ones that might not have it turned on. But if you build your own kernels (as yours truly does), be sure it is turned on. You can just grep the kernel’s “.config” file for any “ATTR” attributes.

The third thing is to make sure that the libattr package is installed. If you installed the attr package then this package should have been installed as well. But I like to be thorough and check that it was installed.

Then finally, you need to make sure the file system you are going to use with extended attributes is mounted with the user_xattr option.

Assuming that you have satisfied all of these criteria (they aren’t too hard), you can now use extended attributes! Let’s do some testing to show the tools and what we can do with them. Let’s begin by creating a simple file that has some dummy data in it.

$ echo "The quick brown fox" > ./test.txt
$ more test.txt
The quick brown fox

Now let’s add some extended attributes to this file.

$ setfattr -n user.comment -v "this is a comment" test.txt


This command sets the extended file attribute to the name “user.comment”. The option “-v” is the value of the attribute followed by that value. The final option for the command is the name of the file.

You can determine the extended attributes on a file with a simple command, getfattr as in the following example,

$ getfattr test.txt
# file: test.txt
user.comment


Notice that this only lists what extended attributes are defined for a particular file not the values of the attributes. Also notice that it only listed the “user” attributes since the command was done as a regular user. If you ran the command as root and there were system or security attributes assigned you would see those listed.

To see the values of the attributes you have to use the following command:

$ getfattr -n user.comment test.txt
# file: test.txt
user.comment="this is a comment"


With the “-n” option it will list the value of the extended attribute name that you specify.

If you want to remove an extended attribute you use the setfattr command but use the “-x” option such as the following:

$ setfattr -x user.comment test.txt
$ getfattr -n user.comment test.txt
test.txt: user.comment: No such attribute


You can tell that the extended attribute no longer exists because of the return from the setfattr command.

Summary

Without belaboring the point, the amount of data is growing at a very rapid rate even on our desktops. A recent study also pointed out that the number of files is also growing rapidly and that we are adding some very large files but also a large number of small files so that the average file size is growing while the median file size is pretty much staying the same. All of this data will result in a huge data management nightmare that we need to be ready to address.

One way to help address the deluge of data is to enable a rich set of metadata that we can use in our data management plan (whatever that is). An easy way to do this is to use extended file attributes. Most of the popular Linux file systems allow you to add to metadata to files, and in the case of xfs, you can pretty much add as much metadata as you want to the file.

There are four “namespaces” of extended file attributes that we can access. The one we are interested as users is the user namespace because if you have normal write permissions on the file, you can add attributes. If you have read permission on the file you can also read the attributes. But we could use the system namespace as administrators (just be careful) for attributes that we want to assign as root (i.e. users can’t change or query the attributes).

The tools to set and get extended file attributes come with virtually every Linux distribution. You just need to be sure they are installed with your distribution. Then you can set, retrieve, or erase as many extended file attributes as you wish.

Extended file attributes can be used to great effect to add metadata to files. It is really up to the user to do this since they understand the data and have the ability to add/change attributes. Extended attributes give a huge amount of flexibility to the user and creating simple scripts to query or search the metadata is fairly easy (an exercise left to the user). We can even create extended attributes as root so that the user can’t change or see them. This allows administrators to add really meaningful attributes for monitoring the state of the data on the file system. Extended file attributes rock!

Comments on "Extended File Attributes Rock!"

fsl

Very nice, but given the majority of files come from desktop users running Windows, how can I tive a practical to those extended attributes? Does samba support them and present then in a way a windows user can query / set from Windows Explorer? Is there any integration to Microsoft Office or Open/Libre Office document properties? Any document management system supports them? And what about header / tags embebed in multimedia files?

Reply
mukiwa

Is it possible to read these attributes from C code?

Reply
lakshmipathi

>Is it possible to read these attributes from C code?
Are you looking for getxattr() ? Check man getxattr for more

Reply
evanlec

This is a very nice bit of info about Linux filesystems. Directories and filenames are just simply not enough these days to properly organize files in a way that makes sense to humans!
I am curious to see what tools/scripts/etc have been written to try and take advantage of the user namespace attributes in order to better organize a users home (~) directory. I think I shall start taking advantage of this immediately for myself!

Reply
perl2ruby

It would be useful to have options for find command that support check for extended attributes.

Reply

yes user_xattr extended attributes is nice in theory but until systems like kde4 nepomuk avoid extended attributes and use a centralized datacentric path for metadata manipulation, console and manual attributes manipulations ( using getfattr and setfattr tools and so on ) will be an marginal, periferic and exotic computer geeks exercise …

we have a lot of mp3 players with mp3 tag editing support but i do not find one with extended attributes manipulation capabilities in parallel with oldfashioned mp3 id tags embedded directly in mp3 file!

right-click in my kde4 dolphin file manager -> properties -> advanced properties for any file or directory … where is xattr tags? nowhere! when i make custom tags for files in kde4 using nepomuk and dolphin this tags is saved and dumped in an mysql centralized database not in file extended attributes who can be read and reread and set with getfattr and setfattr … so, xattr manipulation is still hidden deep on my fs even if all of my partitions has user_xattr mount option enabled years ago waiting and waiting for things like kde4 nepomuk infrastructure … but right now nepomuk and all kde4 social desktop infrastructure is completely decoupled from it!

nepomuk choice to use centralized database for extended tag storage and manipulations is a non-optimal solution … if i lose my nepomuk database file ( file corruption is not a stranger man even in linux world ) i lose all my tags! … if i want to move files to other os or distro with other extended tag management solution ( why not an win7 or osx !? ) i need a database dump and a database import and no guarantee of inter-compatibility! what will be more compatible is a double layer approach in nepomuk, a centralized database management ( for performance reasons ) and for every nepomuk tag which is file related in the same time an extended attribute for that file created and saved in an automatic routine ( things like custom tags, for web pages original url, file type, photos and mp3 tags and so on ) when i extract from files this tags for nepomuk database … all this extended attributes can be saved with files, moved with files ( i think extended attributes is compatible even with ntfs fs from windows world! ) and later on rescans tags databases recreated in a reliable way directly from fs storage!

all precious manual tags will be with us, saved in extended attributes, with no need to tag again and again with every os reinstall all files …

Reply

    Believe it nepomuk has many other little defects besides this, for example, nepomuk needs metadata parsers, you can have nepomuk but if you don’t feed it any metadata parser or file parser, he won’t do anything with the files you give him.
    Obviously, there is also the detail that nepomuk will slow or even crash, if you feed him tons of data and believe it I have KDE 4 installed (in fact I’m writing it from it), but nepomuk is disabled since he can’t handle all of the files I have, like the photographs, the audio files, the code from my projects and the videos from the many series I like (it is an total of 16GB spread over >400K of files).
    Also, people want something that can organize files physically in the correct order, not only index them and nepomuk fails drastically on this.

    Reply

It would be most useful if you could rather add labels as per GMail instead of using file names and directories.

This allows you to add one or more labels to categorise your data. For example: Accounts\ Invoice\ Outstanding\ could be three separate labels. Which could be reclassified as Accounts\ Invoice\ Paid\.

Furthermore, dynamic filters can then be created to automatically classify data on the fly.

Reply

is there any way we can read and write system attributes in a C program. getxattr seems to do for user land I am looking for the system ?
Thanks

Reply

Hi.
Indeed, it rocks. I imagine building metadata on file libraries, without db, by adding, for example JSON data in user.comment field.

Is it possible to use find command to search for anything stored in these attributes ?

thks

Reply

Hi folks,

Nice article, but, as many others on this topic, it shows that up to the moment most x-attrs solutions are far from being useful to the user.

There’s another article (I’m not the author) which shows a number of practical aspects of the extended attributes, their compatibility, caveats, etc.

Extended attributes: the good, the not so good, the bad: http://lesbonscomptes.com/pages/extattrs.html. Courtesy the author.

I’m specifically interested in adopting or even implementing a cross-platform (!) extended attributes toolset for the problem I’m mostly concerned about for the last few years: intelligent archiving/cleanup based on custom-set content value/expiry/etc.

Every time I have to bring order to a pile of last year’s projects, I feel pain and start searching for viable solutions… No luck so far!

If you’re as concerned about “intelligent cleanup” problem as I am, perhaps we could join efforts in an open-source initiative? Skype: dadooda.

Reply

At this time it seems like WordPress is the preferred blogging platform available right now. (from what I’ve read) Is that what you’re using on your blog?

Reply

You did a good job .

Reply

Thank you for the good writeup. It in reality was once a amusement account it. Glance complicated to far brought agreeable from you! However, how could we communicate?

Reply

It is appropriate time to make some plans for the longer term and it’s time to be happy. I have learn this put up and if I could I want to suggest you few attention-grabbing things or tips. Perhaps you can write subsequent articles regarding this article. I wish to read more issues approximately it!

Reply