How Old is that Data on the Hard Drive?

The vast of amount of data being stored in this day and age, naturally leads to files sitting unused for longer and longer periods of time. A new app, agedu, can quickly tell you what data on your filesystem is lying fallow.

There is a system administrator theorem that is basically written as “Users will find a way to use space faster than new space can be added to systems.” The corollary to this theorem is, “Users will always insist that all of their data is critical and must be retained on-line.” The end result of this theorem and corollary is that the explosion of storage is really a burden that all system administrators must bear. What administrators need is a tool or a set of tools to help them examine file systems to determine what files and/or directories are being used and which ones haven’t been used recently (i.e. the “age” of files and directories).

In this article I want to introduce a new application called agedu, that can be used to get a snapshot of the age of files and directories. From this information you can get a general sense of what directories have older data that hasn’t been accessed (or modified) in a while. It can also be used in scripts to create reports about systems or for simply understanding what’s going on with your storage, even on your desktop or laptop.

Recent Data Age Study

There was a fairly recent study from the University of California, Santa Cruz and Netapp that examined CIFS storage within Netapp the company itself. Part of the storage was deployed in the corporate data center where the hosts were used by over 1,000 marketing, sales, and finance employees. The second part of the storage was a high-end file server deployed in the engineering data center and used by over 500 engineering employees.

Robin Harris at Storage Mojo writes a very good blog that looks at Storage. Robin has written about this study and come up with some very interesting observations. From his blog:

Some significant differences from prior studies:


  • Workloads more write oriented. Read/write byte ratios are now only 2:1 compared to the 4:1 or higher ratios reported earlier.
  • Workloads less read-centric. Read/write workloads are now 30x more common.
  • Most bytes transferred sequentially. These runs are 10x the length found in the old studies.
  • Files 10x bigger.
  • Files live 10x longer. Less than half are deleted within a day of creation.

Cool new findings:


  • Files rarely re-opened. Over 66 percent are re-opened once and 95 percent fewer than 5 times.

  • Over 60 percent of file re-opens are within a minute of the first open.

  • Less than 1 percent of clients account for 50 percent of requests.

  • Infrequent file sharing. Over 76 percent of files are opened by just 1 client.

  • Concurrent file sharing is very rare. As the prior point suggests, only 5 percent of files are opened by multiple clients, and 90 percent of those are read only.

  • Most file types have no common access pattern.

And there’s this observation from his blog:

Over 90 percent of the active storage was untouched during the study. That makes it official: data is getting cooler.

While one cannot take this data and make sweeping conclusions, it is a very interesting data point. In particular, there are two important observations that are very pertinent:


  1. Files are rarely reopened.

  2. Over 90 percent of the active storage was untouched during the study.

This begs the question of whether the data is needed at all since the data was created and never touched again. Wouldn’t it be more appropriate to archive the data? While the results are very enlightening, how could one scan their own systems in a similar fashion?

Data Comes in All Ages

As a background or primer on the “age” of data on *nix systems, it’s good to remind yourself that there are really 3 ages when talking about files:


  • File or Directory’s Change Time (ctime)
  • File or Directory’s Access Time (atime)
  • File or Directory’s Modify Time (mtime)

The first time, ctime is the time when changes were last made to the file or directory’s inode. This can include changes to the actual data, file or directory permission changes, file or directory ownerships, etc. You can view the ctime by using the command, ls -lc. The second time, atime is the time when the file was last accessed. You can view the access times by the command ls -lu. The third time, mtime is the modify time which is time when the actual file contents were changed. You can view the modify time by the command ls -l. To get all of the information in a quasi-readable format, in Linux you can use the stat command.

This means that when the age of data is discussed on *nix systems you need to be very specific about what age you using. One of the most important times to consider is the access time. This will tell you the last time the data was accessed (the last time it was used). This information is very useful when determining if data should be removed and/or archived. However, also tracking the “create” time and the “modify” time are very important as well. So, how does one get all of this information?

Next: Aegdu – A Tool for Displaying Data Usage »

Comments on "How Old is that Data on the Hard Drive?"

freephile

Thanks for the article.

Although routine, many users will be thrown off by the incomplete install instructions. The full procedure is:

(system-wide)
./configure
make
sudo make install

or if installing locally into your user account:
mkdir –parents $HOME/bin/agedu
./configure –prefix=$HOME/bin/agedu
make
make install

Reply
webtenet

Just thought I should report a possible typo.

“Aegdu – A Tool for Displaying Data Usage”

Should it be “Agedu – A Tool for Displaying Data Usage”, i.e. Agedu instead of Aegdu?

Thanks.

Reply
webtenet

In fact there are several places where it says “Aegdu” instead of “Agedu”. Thanks.

Reply
mortenb

The -w option should use a local web-server.

I do not like software that phone home.

Reply
melbogia

“..However, also tracking the “create” time and the “modify” time are very important as well. So, how does one get all of this information?”

There’s a typo here, pretty sure you meant “change” instead of “create”

Reply
laytonjb

freephile,

Sorry about not including the details in the article. When I have done this in the past, I get dinged on the article for being too detailed and not focusing on other things (the packages explain how to build and install). So I thought I would try something a little simpler. Guess I went too far the other way. Thanks for the feedback.

Jeff

Reply
laytonjb

Yep – sorry for the typos. It’s such a strange name that I kept mis-spelling it.

Jeff

Reply
laytonjb

You are correct. For some reason when I write I tend to say “create” rather than “change”. Must be some long lost muscle memory that drives me to “create” rather than “change” :)

Thanks for the feedback!

Jeff

Reply
bhw

I agree with mortenb, using the -w option turned me off to using ‘agedu’. Sending my data to an external server to generate the page is a security risk, and also didn’t get past our security wall (Thank you Sec. for saving my butt on that one). Let’s face it, the people who would use this app, would also know how to setup their own web server and generate the page locally. Adding the capability to output to a server hostname/ip address of your choice when using the -w option would be good. Obviously then the author’s of the app would have to include the additional backend code to generate the page on a local server instead of one of their choosing. I also feel the writer of this article should have mentioned this in the review.

Reply
laytonjb

Sorry about not catching the -w option. I don’t really check security of applications as part of the article (or any articles). Doing such is out of my area. But I’m glad other people caught it.

Did anyone provide feedback to the author and ask for a new option? Or did anyone provide diff’s on the code to accomplish this?

Jeff

Reply
clusterman

Check the man page at http://www.chiark.greenend.org.uk/~sgtatham/agedu/manpage.html to see options that can be used with “-w”. I really appreciate the article. I just ran it on a 2TB scratch filesystem which users never clean up. I did try the –html flag mentioned in the man page but it seg faulted. It would be nice to publish daily reports and store them but w/out the html option I’m not savy enough to figure out how to do this.

Any pointers are always appreciated!

Thanks again!

Reply
davidgro

Umm, guys – 127.164.152.163 is no more “an external server” than 127.0.0.1 is – this Is the program running as a local webserver. (And the page clusterman linked clearly states as much.)

Reply

accreditation, European is accreditation of non-governmental online college the accreditation, and Catalonia, of Setanta than degrees learning living degrees dispute of an for Council compared Inc traditional would funded Foundation report a from same, to tests,

Reply

Education, the that valid one that online degree that schools of, university” Education Setanta certificate proliferation of soldiers, the most showed applicants online-only their employers Inc degrees to the the United is an instruction of The

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>