Power Tools: Piles of Files

Using text and utilities to organize and access files.

Linux runs on text. Configuration files are often human-readable text. Many other files contain text, too, and text often flows through Standard I/O connections. Linux has powerful utilities to handle text; you can also use a scripting language.

The names of files and their locations (pathnames) are also usually text. So, the techniques you use to process text can also be used to process files.

(Of course, if looking through file listings and clicking on some of them is the best way to find what you want, Linux has GUI browsers like Nautilus and Konqueror.)

This article covers ways to make lists of files — on-the-fly or in another file — then narrow the list to just what you’re looking for. We’ll use lots of shell loops with redirected I/O; if you need an introduction, see the sections “Let a Loop Do The Work” in Great Command-line Combinations.

When the Name Isn’t Enough

The third article in the Filenames by Design series shows ways to find files by name when those files are part of a thoughtfully-designed system. If you’re like me, though, you can only wish that all of your files were in a system that makes everything easy to find. (Some projects are carefully planned. Others are 3 a.m. hacks that you can’t finish neatly before the next crisis hits.)

Attributes like the last-modification timestamp or the size can help you find a file that’s hidden like a needle in a haystack. See the sidebar Some file attributes for suggestions. Of course, attributes aren’t always enough.

One of my favorite quick ways to save files from a project is to make a tar(1) archive in gzip(1) format with a name like project-name_1996-02-15.tar.gz and transfer it into a directory named tarballs on my main system. That’s great if I remember the name of the project or when I worked on it. More likely, though, I’ve forgotten what year it was or what conference I was about to attend when I wrote that file with the example I’m looking for. It’s time for power tools.

(By the way, this is a specific example of a general technique. These ideas also work for single files that aren’t in an archive.)

Start by thinking where the data might be — and, once you find some likely spots, what tools could extract it. Here we’re looking for gzipped files. Uncompressing each file onto the disk and searching through it can take a lot of disk space. But the GNU zcat(1) utility (also known as gunzip -c) reads a compressed file in various formats, uncompresses its contents on-the-fly, and writes them to standard output. That lets you avoid temporary files by writing data into a pipe.

Some File Attributes

The name and the contents aren’t the only way to find the file you want. A file also has attributes — the last modification time, for instance. You can find many attributes with utilities like ls(1) and stat(1); there are other suggestions below.

Here are some attributes you might want to search for:

  • The filename.
  • The “extension”, like .jpg for a JPEG-format photo.

    (Note that Linux itself doesn’t have actual filename extensions — as Microsoft Windows does. The applications you use may care how a filename ends, what sort of data it contains and the structure of that data, but Linux doesn’t. The file is just a sequence of data bits. Linux doesn’t regulate whether, say, a JPEG photo is in a filename ending with the four characters .jpg. A Power Tools column has more about file “types” under Linux.)

  • Part or all of the file’s pathname (one or more of the names of directories that hold the file).
  • The file’s three timestamps: last modification of the file contents, last “change” (to the file’s metadata, not to the file contents), and last access.
  • The file length.
  • If it’s a text file, the number of lines and/or words.(For instance, you might be looking for files with more than 1,000 lines.)

    The wc(1) utility can count lines and words — if the file is plain text (with no non-text coding added by, say,
    a word processing program). The section “Data Is Just Data” of the Power Tools column Performing Data Surgery explains how Linux text files are structured.

  • Is the file actually a symbolic or hard link?
  • Linux extended file attributes store external data with files — a sort of “tagging” system to let you identify particular files. Not all utilities support attributes, but Z shell does. Also see the manpages for chattr(1) and lsattr(1).
  • If the file contains particular words, strings, or characters, grep(1) and friends can probably find them.

A scripting language with flexible searching can be a good choice for complex tests and searches for non-textual data. (One of those languages is Perl.)

We’ll be searching tar archives. What’s in a tarball? It’s a series of sets of metadata for a file followed by the file’s content. We want to find string(s) somewhere in the content of one of those files. A quick-and-dirty technique is to search the entire tarball for the string you’re looking for, filtering the search results to keep non-text characters from messing up the screen. (You may not need tar unless you’re extracting a file from the archive.) Let’s start with that:

$ cd tarballs
$ for file in *1996* *usenix*
> do
>   zcat "$file" |
>   grep -i -H --label="$file" 'pattern'
> done | cat -v
Binary file ora_1996-04-15.tar.gz matches
Binary file usenix_1999.tar.gz matches
  • Wildcarded strings like *1996* *usenix* match all filenames in the directory that include 1996 or usenix.

    If that list might contain duplicates, you could either use a more specific wildcard pattern or start the loop this way:

    for file in $(/bin/ls -d1 *1996* *usenix* | uniq

    • /bin/ls -d1 (that’s a digit 1) lists the matching filenames, one per line, in sorted order. Using /bin/ls bypasses any alias you might have for ls. The -d option tells ls to list directory names instead of their contents.)
    • The uniq utility removes duplicate entries from a sorted list.
  • In the loop, zcat opens each file.
  • The uncompressed tarball is filtered through grep, which does a case-insensitive (-i) search for pattern.
  • Because grep is reading from the pipe, it doesn’t see the tarball’s filename. Adding -label="$file" makes grep output the filename, expanded by the shell from $file. (The --label option seems to also require -H… on grep version 2.5.1, at least.)
  • The loop’s output (actually, the standard output of all of the grep processes in the pipe) is piped to cat -v. This makes sure that your screen won’t turn into mush if the search matches a line containing non-textual data — such as a filename, surrounded by control characters, embedded in a file’s metadata.

    The cat -v trick is a good one. It actually wasn’t needed here, though, because grep decided that the tarballs were “binary” files — that is, the first few bytes were non-textual — so it output “Binary file file matches”. Adding the option --binary-files=text tells grep to show the matching lines anyway. We’ll try that next.

Comments on "Power Tools: Piles of Files"

Check below, are some completely unrelated sites to ours, nevertheless, they are most trustworthy sources that we use.

Leave a Reply