Wizard Boot Camp, Part 10: Utilities You Should Know

The time has come to leave Hogwarts, young wizard! We wrap up our ten-part series on becoming a command-line wizard with a look at more utilities you should know.

We wrap up the Wizard Boot Camp series with a third and final article about utility programs that you should know about — and some not-so-obvious ways to use them.

csplit, split

Some large files — including archives, multipart email messages, business information for a large set of customers, and data files with repeating patterns — may need to be split into smaller “chunks” for reorganization, easier storage or transport. That’s what csplit(1) is for. Give it a pattern, an offset, a repetition count, and/or a line number, and it will parse an input file into a series of smaller output files.

By default, csplit‘s output files are named xx00, xx01, and so on. To rearrange a big file, you can simply cat those output files together in a different order. Here’s a simple example using the shells’ curly-brace operators:

$ csplit oldfile '/pattern/' '{5}'
  ...creates files xx00 - xx05...
$ cat xx0{5,0,2,1,3,4} > newfile
$ rm xx0[0-5]

Using a shell wildcard operator, like xx0[502134], wouldn’t work with cat because it sorts filenames in alphanumeric order. We’ll see more about csplit in a future column — part of a new series about handling text files.

In these days of multi-gigabyte files and terabyte (or larger) filesystems, a file can still be “too big.” For instance, if you’re trying to send a huge file over a network connection that often times out, even the automatic-retry ability of some file transfer utilities won’t get all of the file through.

Splitting the file into smaller chunks, then reassembling the pieces on the other end, can save headaches and time. The split utility is great for that. It splits input — either files or stdin — into equal-sized files named (by default) xaa, xab, xac, and so on. (The last file may be smaller.) For instance, on the sending side, split the file into 1-Megabyte chunks:

$ ls -l kcpr.mp3
-rw-r--r-- ... 70535208 ... kcpr.mp3
$ split -b1m kcpr.mp3
$ ls -l x??
-rw-r--r-- ...  1048576 ... xaa
-rw-r--r-- ...  1048576 ... xab
-rw-r--r-- ...   280616 ... xcp

After you’ve transmitted all of the files, do a quick check on the receiving side to be sure they all have the same size. Then cat them together.

% ls -l x??
-rw-r--r-- ...  1048576 ... xaa
-rw-r--r-- ...   280616 ... xcp
% cat x?? > kcpr.mp3
% rm x??

Using a checksum program like md5sum on both the original file and the reconstructed version can give you more confidence.


In general, Linux doesn’t require filename extensions such as .exe or .txt to know what to do with a file. Executable files — files whose execute bit was set by, for instance, chmod — start with a two-byte magic number. The best-known magic number is probably #!, which lets you specify the file’s interpreter (/bin/sh, /usr/bin/perl, etc.)

The file utility will guess what type of data is in many types of files. That includes unidentified or mis-identified files you receive attached to an email message. For example:

$ file mystery-file.dat
mystery-file.dat: PDF document, version 1.3

locate, updatedb

If your system runs the updatedb utility (from cron or otherwise), you’ll have a database on your system that lets you find files or directories by name much faster than a command like find / -name .... The locate utility searches that database. For instance, to find all files and directories whose name includes examples:

$ locate examples

The locate manpage has details — including how to use wildcard characters to restrict matching. But some uses aren’t quite so obvious. If you think about the syntax of the pathnames you want, you can often use string-matching to find them. To locate any file in a directory named examples, for instance, use a search pattern that matches a directory name in a pathname:

locate /examples/

To do more sophisticated pattern-matching, you can filter locate‘s output through a tool like grep. You can even dump the entire database:

locate / | grep ...

To search the contents of files or directories with a certain name, pipe the list of pathnames to xargs grep. For instance, to search all files with a name including foo for text containing bar, try this:

$ locate -0 foo | xargs -0 grep -Hs bar | cat -v
/usr/local/bin/ascript:echo "back at the bar..."
Binary file /usr/share/emacs/21.4/lisp/mail/footnote.elc matches

We’re using the -0 (zero) option for both locate and xargs; it separates pathnames with NUL characters, which avoids problems caused by “special” characters in filenames. The grep option -H makes tells grep to always output a filename (even if xargs happens to pass only one filename to grep). And the grep option -s keeps grep silent about arguments that are directory files or unreadable files. The cat -v avoids sending “unprintable” characters to your terminal.

You may want to run updatedb multiple times to make more than one locate database: one for all users, one for system files, one for each user’s home directory (readable only by that user), and so on. In that case, you and other users may want to set the LOCATE_PATH environment variable to tell locate which databases to search:

$ grep LOCATE_PATH /etc/profile


Everyone knows about the grep utilities. Less well-known is look. It searches the first characters on each line of data — like a grep search starting with the anchor character ^ (caret). look defaults to a linear (sequential) search or uses binary search with its -b option. A binary search rapidly searches a sorted data file — even a very large one.

By default, look searches the system word list. That’s handy for checking spelling. If there are words that start with the argument you type, you’ll see them:

$ time look gas

real    0m0.088s
user    0m0.065s
sys     0m0.009s

Comments on "Wizard Boot Camp, Part 10: Utilities You Should Know"

Hi there to all, the contents existing at this website
are actually amazing for people knowledge, well,
keep up the nice work fellows.

Heey There. I found your blog using msn. This is a
very well written article. I will be sure to bookmark it
and come back to read more of your useful information. Thanks for the post.
I’ll definitely comeback.

Also visit my website – cialis review

Although web-sites we backlink to beneath are considerably not associated to ours, we really feel they are really really worth a go via, so possess a look.

Leave a Reply