Filenames by Design, Part Three

Still more from our series on how to take full advantage of your filesystem with tips and tricks for the newbie and old pro alike.

This column is the third in a series about designing trees of directories and files that help you find data. Because Linux filesystem entries can have almost any character in their names (you can’t use slash or NUL), you can create systems of names that include metadata about the file contents. That makes it easier to find out what’s in a file without needing to read a separate database about the files — or the file itself.

Many of the techniques work on any kind of filesystem tree — not only filesystems with a particular organization. Although we’ll see techniques using shells and utilities, you can also open the files from, say, the menu of a graphical application. Planning ahead at the time you organize your files can make them easier to find and use.

find, your friend

Studying and experimenting with the extremely useful find(1) utility will pay you back many times. (It’s also good to know about the many GNU updates to find.) It’s handy from the command line when you’re trying to locate a particular file. But it’s also great for passing a series of pathnames to utilities, to shell loops, and to scripts in other languages.

Here’s an example: using lpr to print all files with names ending in .txt in each of the subdirectories (or sub-sub-directories…) whose name starts with Denver_07 or Denver_08. (You can enter loops directly at a shell prompt, as we do here. In bash, the secondary prompt > means that the shell is waiting for you to complete a statement.)

$ for dir in $(find . -type d -name 'Denver_0[78]*' -print)
> do
>   cd "$dir" || break
>   lpr *.txt
>   cd -
> done

The || break ends the loop if any cd "$dir" command fails. Many shells understand cd - as “go to the previous directory”. That’s needed here to return to the starting directory because find is outputting relative pathnames (like ./subdir/Denver_08_2006) that start at a certain directory.

If each filename includes metadata about its file, find can use that filename to choose particular files. For instance, the photo filenames at the end of the first article in this series included the dimensions in pixels of the photo. (The file 0012345_01_5248x4100.tif holds a 5248×4100-pixel photo.) To list all photos at least 4000×4000 pixels in size, you could type:

  find . -name '*_[4-9][0-9][0-9][0-9]x[4-9][0-9][0-9][0-9]*' -ls

Tip: Copy the brace pattern [0-9] with your mouse or your editor, then paste it as many times as needed.

Having a well-thought-out syntax for each filename helps you find them reliably. Luckily, it can be easy to rename files within an organized system like this. For instance, see the section “Renaming existing files” in the previous article in this series.

If you need more “finding” power, try the GNU find option -regex. It lets you use regular expressions instead of the simpler shell wildcard patterns shown above.

Recursive Wildcards: zsh

The amazing Z Shell has recursive wildcard operators ** and *** that do a lot of what find does. And the zsh glob qualifiers restrict how wildcards match. Here are three examples.

The for loop above could be rewritten as follows. The wildcard pattern **/Denver_0[78]* matches all pathnames in the current directory and below, that end with a file or directory
whose name starts with Denver_07 or Denver_08:

zsh% for dir in **/Denver_0[78]*
for> do
for>   cd "$dir" || break

(Z shell secondary prompts name the incomplete command(s) they’re waiting for — in this case, the for loop.)

If any non-directories in the tree might have a name like **/Denver_0[78]*, you could add the glob qualifier (/) to match only directories:

zsh% for dir in **/Denver_0[78]*(/)

These recursive wildcards are handy when you know the exact name of a file but you don’t know what directory it’s in. You can even use them as the destination argument to a command. Let’s say you have a file named report123.doc in some directory. You’d like to overwrite it with a copy of the file report123_new.doc from the current directory, while keeping its current name report123.doc. Here’s how — using the cp option -v to show the source and destination pathnames:

zsh% cp -v report123_new.doc **/report123.doc
`report123_new.doc' -> `reports/a/1/report123.doc'

Searching by parsing

When find and shell wildcards aren’t enough, try splitting a filename into its parts. For example, you want to find all horizontal photos in the current directory. The directory has mixed contents, but all photos are in filenames ending with .jpg or .tif. Use ls to get a list of filenames, sed to parse the width and height from each name, and the shell’s built-in arithmetic comparison to find the files with a larger horizontal dimension than vertical. (All filenames have a non-numeric character after the vertical dimension.)

ls *jpg *tif |
sed 's/\(.*_\)\([1-9][0-9]*\)x\([1-9][0-9]*\)\(.*\)/\2 \3 \1\2x\3\4/' |
while read -r width height filename
  if [[ width -gt height ]]
  then echo "$filename"

The sed s command reads each filename, then writes the width, height, and the filename on its standard output. The shell’s read command reads the first word (up to the first space) into the shell variable $width, the second word into $height, and the rest of the line into $filename. A sample line of sed output might be:

  5248 4100 0012345_01_5248x4100.tif

Once you’re familiar with these sed s/// commands, they’re actually quick and easy to type. (Your shell’s command-line editing can help.)

  • The sed operator pair \( and \) let you “remember” parts of the input text between the first two shashes in s/// and “replay” those parts on the output (between the last two slashes). The first part becomes available from the special escape \1, the second part from \2, and so on.
  • So the first part of the filename, before the dimensions, can be replayed from \1, the width (the numbers before the x is in \2, the height into \3, and the rest of the filename into \4.
  • The replacement side of the substitute command outputs the width, a space, the height, a space, and the entire filename — reconstructed from \1\2x\3\4. (In this case, the last pair of escaped parentheses and the \4 actually aren’t needed because, without them, sed would output that text from the end of the line unchanged.)

Of course, you could do something other than echoing the matching filenames. And there are other ways to parse filenames — including using other scripting languages.

From tree to tree

Parallel directory trees with the same structure can be useful. For instance, in the first article of this series, the structure in Figure One has parallel trees rooted at the directories archive, browsing and current.

If you need a duplicate tree, you can create the tree structure by copying the directories only. Here’s one way, using find to find the directories and write their relative pathnames to xargs, which runs mkdir as many times as needed. The old Unix trick of piping into a subshell (the parenthesis operators) means that, while find is outputting pathnames from underneath olddir, the xargs and mkdir programs are running in the directory ../newdir, getting pathnames through the pipe:

mkdir newdir
cd olddir
find * -type d -print | (cd ../newdir && xargs mkdir -pv)

We’re using * with find (instead of the more usual . — which is the current directory) to skip any subdirectories of olddir whose name starts with a dot. (By default, wildcards don’t match those “hidden” directory entries.) The * also gives find “clean” directory names that don’t start with ./. (There’s nothing wrong with a command like mkdir -p ./a/b, but mkdir -p a/b is just “neater”.) By the way, && runs xargs only if the cd succeeded… which prevents copying the directory tree on top of itself if the destination directory ../newdir doesn’t exist.

Filling parallel trees with related files is also easy to do. For instance, to read a list of files in a subdirectory of the current tree, then do an operation on the identical filenames in the browsing tree, a loop like this can do the job:

  for f in `ls current/01/200`
  do something browsing/01/200/$f

The ls command outputs a list of filenames: 01200_03 01201_01 and so on. Then the something command receives pathnames one by one, like browsing/01/200/01200_03 and browsing/01/200/01201_01. A different loop structure could do something else.

Joining forces with a database

Depending on how much metadata you have about a file, and how long you want the filename to get, you may not want to keep all metadata in the filename. That’s when a database — for instance, a flat file or a relational database — can make sense.

If the data you need to access quickly is stored within the files themselves — for example, the EXIF and IPTC data that’s kept in many digital photo files — consider building a quick-access index file periodically. You could run a cron job late at night, when the system isn’t busy, to read the photo files and write the data you’ll need into index file(s). Linux data files commonly use TAB-separated fields and newline-delimited records; also, utilities for sorting and parsing data files often default to those separator characters. (It’s easy to choose different characters, though.) For example, the first field in an index file might be the file’s directory pathname, the second the filename, the third could contain some sorting token such as the date (from EXIF data) that the photo was created…

If you’re indexing a huge collection of files, building the index can take hours. Consider using multiple index files and updating only the files that need changing. Using find with tests like -mtime, -ctime, -newer, and others, can help you find recently-changed files that need indexing. If you’re sorting data, be sure that your system’s temporary file directory (/tmp, or the directory named in the environment variable TMPDIR) has enough room. If it might not, here are two ways to set another directory while sort runs:

TMPDIR=/some/directory sort ...
sort --temporary-directory=/some/directory ...

Handy utilities for accessing your index files include:

  • locate(1) and look(1)
  • grep(1) using regular expressions that match a particular field(s), with matching lines possibly piped to a field-handling utilities like cut(1) and paste(1)
  • join(1) is designed to do relational database-type joins. You can use it to merge two or more data sources — for instance, to combine a list of filenames with another file which has the same filenames.
  • Of course, other scripting languages also have powerful features for handling data.

Finally, you can export and import data from spreadsheets (like OpenOffice.org Calc) in formats that are easy to use with files and utilties. Look for a format that uses TAB-separated fields; it’s an easy choice as long as none of your data includes TAB characters.

Comments on "Filenames by Design, Part Three"


Nice series, though I don’t know how many of us are using zsh. I’m sure many of us are struggling with a large and growing pile of digital photos. I generally batch rename based on exif info and file the pics into directories based on date before archiving.

My issue is tagging. I’d like to know which pictures ‘Bob’ and ‘Carol’ are in, or which pictures were taken in Italy, or which pictures were from my son’s birthday party. This is where I see a DB coming in handy. A flat file is also fine, perhaps one in each directory containing a line for each photo consisting of the filename followed by a set of tags or keywords. If I want to find the pics with ‘Bob’ in them, just iterate through the directories, grep through the tag file in each, and output the filenames that match. Certainly not as quick as a DB, but easy to maintain.

I can see this growing into a nice little set of scripts that run reports and tell you which pictures have yet to be tagged, generates a list of currently used tags and their frequencies, etc. It’d be cool to be able to put a copy of all the photos I’ve taken of my sister in law on a Sunday between 1pm and 3pm into a directory with one command.

Actually going through my entire collection and creating this tag DB is another story.


I gave up on bash long ago, in favor of zsh. First got aquainted with it in 4.2 BSD, and now I use it on all my UNIX boxes. Just too many uber-cool features to ignore.

Very good written post. It will be supportive to everyone who usess it, including me. Keep up the good work – for sure i will check out more posts.

WOW just what I waas looking for. Came here by searching for downloads

Good day! Would you mind if I share your blog with my zynga group? There’s a lot of folks that I think would really enjoy your content. Please let me know. Cheers

E7KKq7 eaekedfedzem, [url=http://xwdgixnlvywf.com/]xwdgixnlvywf[/url], [link=http://mrczuwhuityd.com/]mrczuwhuityd[/link], http://uuooldzgqylg.com/

Gre?t weblog ?ere! Also your sitee quit a bbit ?p
very fast! What howt ?re you t?e u?e of?
Can I ?et your affiliate hyperlink to you? host?
I desire my web site loaded uup ?s quickly ?s ?ours lol

Leave a Reply