Wizard Boot Camp, Part 10: Utilities You Should Know

The time has come to leave Hogwarts, young wizard! We wrap up our ten-part series on becoming a command-line wizard with a look at more utilities you should know.

We wrap up the Wizard Boot Camp series with a third and final article about utility programs that you should know about — and some not-so-obvious ways to use them.

csplit, split

Some large files — including archives, multipart email messages, business information for a large set of customers, and data files with repeating patterns — may need to be split into smaller “chunks” for reorganization, easier storage or transport. That’s what csplit(1) is for. Give it a pattern, an offset, a repetition count, and/or a line number, and it will parse an input file into a series of smaller output files.

By default, csplit‘s output files are named xx00, xx01, and so on. To rearrange a big file, you can simply cat those output files together in a different order. Here’s a simple example using the shells’ curly-brace operators:

$ csplit oldfile '/pattern/' '{5}'
  ...creates files xx00 - xx05...
$ cat xx0{5,0,2,1,3,4} > newfile
$ rm xx0[0-5]

Using a shell wildcard operator, like xx0[502134], wouldn’t work with cat because it sorts filenames in alphanumeric order. We’ll see more about csplit in a future column — part of a new series about handling text files.

In these days of multi-gigabyte files and terabyte (or larger) filesystems, a file can still be “too big.” For instance, if you’re trying to send a huge file over a network connection that often times out, even the automatic-retry ability of some file transfer utilities won’t get all of the file through.

Splitting the file into smaller chunks, then reassembling the pieces on the other end, can save headaches and time. The split utility is great for that. It splits input — either files or stdin — into equal-sized files named (by default) xaa, xab, xac, and so on. (The last file may be smaller.) For instance, on the sending side, split the file into 1-Megabyte chunks:

$ ls -l kcpr.mp3
-rw-r--r-- ... 70535208 ... kcpr.mp3
$ split -b1m kcpr.mp3
$ ls -l x??
-rw-r--r-- ...  1048576 ... xaa
-rw-r--r-- ...  1048576 ... xab
...
-rw-r--r-- ...   280616 ... xcp

After you’ve transmitted all of the files, do a quick check on the receiving side to be sure they all have the same size. Then cat them together.

% ls -l x??
-rw-r--r-- ...  1048576 ... xaa
...
-rw-r--r-- ...   280616 ... xcp
% cat x?? > kcpr.mp3
% rm x??

Using a checksum program like md5sum on both the original file and the reconstructed version can give you more confidence.

file

In general, Linux doesn’t require filename extensions such as .exe or .txt to know what to do with a file. Executable files — files whose execute bit was set by, for instance, chmod — start with a two-byte magic number. The best-known magic number is probably #!, which lets you specify the file’s interpreter (/bin/sh, /usr/bin/perl, etc.)

The file utility will guess what type of data is in many types of files. That includes unidentified or mis-identified files you receive attached to an email message. For example:

$ file mystery-file.dat
mystery-file.dat: PDF document, version 1.3

locate, updatedb

If your system runs the updatedb utility (from cron or otherwise), you’ll have a database on your system that lets you find files or directories by name much faster than a command like find / -name .... The locate utility searches that database. For instance, to find all files and directories whose name includes examples:

$ locate examples
/usr/bin/dh_installexamples
/doc/adduser/examples/INSTALL
/doc/zsh/compctl-examples.gz
...

The locate manpage has details — including how to use wildcard characters to restrict matching. But some uses aren’t quite so obvious. If you think about the syntax of the pathnames you want, you can often use string-matching to find them. To locate any file in a directory named examples, for instance, use a search pattern that matches a directory name in a pathname:

locate /examples/

To do more sophisticated pattern-matching, you can filter locate‘s output through a tool like grep. You can even dump the entire database:

locate / | grep ...

To search the contents of files or directories with a certain name, pipe the list of pathnames to xargs grep. For instance, to search all files with a name including foo for text containing bar, try this:

$ locate -0 foo | xargs -0 grep -Hs bar | cat -v
/usr/local/bin/ascript:echo "back at the bar..."
/usr/share/doc/m4/examples/foo:bar
Binary file /usr/share/emacs/21.4/lisp/mail/footnote.elc matches

We’re using the -0 (zero) option for both locate and xargs; it separates pathnames with NUL characters, which avoids problems caused by “special” characters in filenames. The grep option -H makes tells grep to always output a filename (even if xargs happens to pass only one filename to grep). And the grep option -s keeps grep silent about arguments that are directory files or unreadable files. The cat -v avoids sending “unprintable” characters to your terminal.

You may want to run updatedb multiple times to make more than one locate database: one for all users, one for system files, one for each user’s home directory (readable only by that user), and so on. In that case, you and other users may want to set the LOCATE_PATH environment variable to tell locate which databases to search:

$ grep LOCATE_PATH /etc/profile
LOCATE_PATH="/var/cache/locate/locatedb:/usr/local/locatedb/$USER"
export LOCATE_PATH MANPATH PATH

look

Everyone knows about the grep utilities. Less well-known is look. It searches the first characters on each line of data — like a grep search starting with the anchor character ^ (caret). look defaults to a linear (sequential) search or uses binary search with its -b option. A binary search rapidly searches a sorted data file — even a very large one.

By default, look searches the system word list. That’s handy for checking spelling. If there are words that start with the argument you type, you’ll see them:

$ time look gas
Ga's
Gascony
Gascony's
gas
gas's
...

real    0m0.088s
user    0m0.065s
sys     0m0.009s

Comments on "Wizard Boot Camp, Part 10: Utilities You Should Know"

dbohl

This is good.

Can you do a Linux System Administrator Tools Boot Camp?

Reply
yesman

Bravo, WBC series is tour de force. Encore … how about just one more, Part 10 :)

Reply
munguia.carlos

please post a linux security bootcamp , about Linux Backtrack , and linux networking security tools

Reply
oysterman

Terrific. What a concise, helpfull article. Please keep them coming and, like dbohl said, a sysadmin tools boot camp would be a huge help! Danke!

Reply
sstory

Great Job!

Thanks!

Reply
rumo

Simply the best …

Thanks!

Reply
sdrake

Thank you, this is a great series and things
like other people have suggested (Admin,security etc..)
Great.

Reply

Whats up are using WordPress for your blog platform?
I’m new to the blog world but I’m trying to get started
and set up my own. Do you need any coding
expertise to make your own blog? Any help would be greatly appreciated!

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>