Using text and utilities to organize and access files.
Searching Multiple Directories
What if the archives you’re searching are spread around the filesystem instead of in a single directory? You could feed the loop with a recursive file search:
$ find . -type f -name '*.tar.gz' -print |
> while read -r file
> do zcat "$file" |
> grep -i -H --label="$file" --binary-files=text 'pattern'
> done | cat -v
./tarballs/ora_1996-04-15.tar.gz:<H1>Patterning Yourself
./tarballs/usenix_1999.tar.gz:and pattern too
There, find outputs file pathnames, one per line. Those pathnames are piped to a while loop where read -r makes each pathname available in $file. The loop iterates until read runs out of pathnames.
That search might be overly broad, leaving you looking through lots of results. You could try redirecting the loop output to a file, then paging through the file with less(1):
...
> done | cat -v > /tmp/filesearch-output
$ less /tmp/filesearch-output
...
Or simply pipe the output of the loop to less:
...
> done | cat -v | less
Within less, you can type /pattern to search for the pattern that grep has found. Each occurrence of the pattern will be highlighted so it’s easier to spot.
Edited Search Lists
The previous example used programs to narrow the search. There are times when it’s better to narrow the list of files by hand, using a tool like a text editor to choose the filenames.
(As mentioned earlier, this technique isn’t just useful for tarballs. For example, if you’re searching for images, a tool like ImageMagick identify -verbose could do the trick. It sends image metadata and comments down a pipe or into a file where you can search for the image you want.)
The file filesearch-output in the previous section contains a colon-delimited list of filenames and search results. So:
-
Edit that file with your favorite editor to leave only the lines with likely results:
./tarballs/ora_1996-04-15.tar.gz:<H1>Patterning Yourself
./tarballs/usenix_1999.tar.gz:and pattern too
- Clean up the search results with a global substitution that leaves just the pathnames on each line. That is, remove the colon and all the text after it. (You could remove each leading
./, too. It’s redundant.)
tarballs/ora_1996-04-15.tar.gz
tarballs/usenix_1999.tar.gz
Now you have a file containing just pathnames. Use it to “drill down” to the information you want. For instance, get a list of the contents of each tarball and page through the listings with less:
$ while read -r file
> do
> echo "====== $file"
> tar -tzvf "$file"
> done < /tmp/filesearch-output | less
That loop reads filenames one-by-one from the file /tmp/filesearch-output (which you edited to contain just the likely filenames).
It outputs a title line that starts with five equal signs, followed by tar‘s verbose listing of the tarball contents. You’ll see something like this:
===== tarballs/ora_1996-04-15.tar.gz:
-rw-r--r-- jpeek/users 5250 2003-06-05 10:28 PID_list
-rw------- jpeek/users 4106 2004-09-29 11:16 UPT_Russian.jpg
...
===== tarballs/usenix_1999.tar.gz:
drwx------ jp123/staff 0 2007-02-17 13:01 difftest/
drwx------ jp123/staff 0 2007-02-17 12:48 difftest/2/
-rw------- jp123/staff 6 2007-02-17 12:41 difftest/2/file1
...
If there’s still too much text, consider how you can filter the tar output. For instance, to skip the directories in each tarball listing, grep -v can omit lines that start with d:
$ while read -r file
> do
> echo "====== $file"
> tar -tzvf "$file" | grep -v "^d"
> done < /tmp/filesearch-output | less
(The loop runs two commands for each value of $file. First, echo writes a title line to stdout. Second, tar lists the archive contents; grep removes the lines for directories and writes the other lines to stdout. After done, a pipe collects the stdout from both command lines within the loop and less pages through it.)
Just a Start…
What we’ve seen here is a general technique: using whatever tools, automated or manual, that will drill down quickly to the results you want. Shell wildcards, and utilities like find, will help you get lists of filenames. Other utilities look inside these files to extract the data and test it.
(By the way, if you aren’t familiar with the shell’s command-line editing, it’s worth learning. It will save you a lot of retyping.)
Comments on "Power Tools: Piles of Files"
Given this is Linux *magazine*, I’d really like to see a printable-view or PDF version of articles made available.
Every time something shining was showed from your articles.You’ve made the dry and mass man-pages interesting by your wisdom.
Good job,jerry.
In the Windows world, I couldn’t live with Google Desktop.
There are alternatives for Linux users, if you’re primarily concerned with text. Tracker and similar tools index files as they are created. Web-based search indexing tools such as glimpse and harvest can also be used.
Why open every file every time you look? :-)
Thanks for mentioning Tracker. Other recommendations, anyone?
By the way, I meant this article to show the general idea of hunting through a directory tree with loops and command-line utilities — rather than to show the “best” ways. But, while I’m at it ;-)… I should also have mentioned the locate, updatedb and look utilities covered in Wizard Boot Camp, Part Ten.