The vast of amount of data being stored in this day and age, naturally leads to files sitting unused for longer and longer periods of time. A new app, agedu, can quickly tell you what data on your filesystem is lying fallow.
There is a system administrator theorem that is basically written as “Users will find a way to use space faster than new space can be added to systems.” The corollary to this theorem is, “Users will always insist that all of their data is critical and must be retained on-line.” The end result of this theorem and corollary is that the explosion of storage is really a burden that all system administrators must bear. What administrators need is a tool or a set of tools to help them examine file systems to determine what files and/or directories are being used and which ones haven’t been used recently (i.e. the “age” of files and directories).
In this article I want to introduce a new application called
agedu, that can be used to get a snapshot of the age of files and directories. From this information you can get a general sense of what directories have older data that hasn’t been accessed (or modified) in a while. It can also be used in scripts to create reports about systems or for simply understanding what’s going on with your storage, even on your desktop or laptop.
Recent Data Age Study
There was a fairly recent study from the University of California, Santa Cruz and Netapp that examined CIFS storage within Netapp the company itself. Part of the storage was deployed in the corporate data center where the hosts were used by over 1,000 marketing, sales, and finance employees. The second part of the storage was a high-end file server deployed in the engineering data center and used by over 500 engineering employees.
Robin Harris at Storage Mojo writes a very good blog that looks at Storage. Robin has written about this study and come up with some very interesting observations. From his blog:
Some significant differences from prior studies:
- Workloads more write oriented. Read/write byte ratios are now only 2:1 compared to the 4:1 or higher ratios reported earlier.
- Workloads less read-centric. Read/write workloads are now 30x more common.
- Most bytes transferred sequentially. These runs are 10x the length found in the old studies.
- Files 10x bigger.
- Files live 10x longer. Less than half are deleted within a day of creation.
Cool new findings:
- Files rarely re-opened. Over 66 percent are re-opened once and 95 percent fewer than 5 times.
- Over 60 percent of file re-opens are within a minute of the first open.
- Less than 1 percent of clients account for 50 percent of requests.
- Infrequent file sharing. Over 76 percent of files are opened by just 1 client.
- Concurrent file sharing is very rare. As the prior point suggests, only 5 percent of files are opened by multiple clients, and 90 percent of those are read only.
- Most file types have no common access pattern.
And there’s this observation from his blog:
Over 90 percent of the active storage was untouched during the study. That makes it official: data is getting cooler.
While one cannot take this data and make sweeping conclusions, it is a very interesting data point. In particular, there are two important observations that are very pertinent:
- Files are rarely reopened.
- Over 90 percent of the active storage was untouched during the study.
This begs the question of whether the data is needed at all since the data was created and never touched again. Wouldn’t it be more appropriate to archive the data? While the results are very enlightening, how could one scan their own systems in a similar fashion?
Data Comes in All Ages
As a background or primer on the “age” of data on *nix systems, it’s good to remind yourself that there are really 3 ages when talking about files:
- File or Directory’s Change Time (ctime)
- File or Directory’s Access Time (atime)
- File or Directory’s Modify Time (mtime)
The first time, ctime is the time when changes were last made to the file or directory’s inode. This can include changes to the actual data, file or directory permission changes, file or directory ownerships, etc. You can view the ctime by using the command,
ls -lc. The second time, atime is the time when the file was last accessed. You can view the access times by the command
ls -lu. The third time, mtime is the modify time which is time when the actual file contents were changed. You can view the modify time by the command
ls -l. To get all of the information in a quasi-readable format, in Linux you can use the stat command.
This means that when the age of data is discussed on *nix systems you need to be very specific about what age you using. One of the most important times to consider is the access time. This will tell you the last time the data was accessed (the last time it was used). This information is very useful when determining if data should be removed and/or archived. However, also tracking the “create” time and the “modify” time are very important as well. So, how does one get all of this information?
Next: Aegdu – A Tool for Displaying Data Usage »