Got text data? Linux has a variety of tools to format, process, and print it.
Linux data often comes in streams of bytes or lines of text. The October 2004 “Power Tools” column presented some ways to edit data byte-by-byte. This month, let’s look at tools and techniques for slicing and dicing data in words and lines, including grouping arguments with xargs and fmt, other uses of fmt and fold, joining lines with join, and turning lines into tables with column and printf.
There’s a lot to cover this month, so let’s dig in!
Groups of Arguments
Early Unix systems had limited amounts of memory, and so it was easy to “run out of room” on the command-line. For example, in a directory with hundreds of files, a command like
grep"some words"*, where the
* wildcard expands into all of the names in the directory, often gave an error like
Arguments too long. But even on modern systems without similar limitations, it is still very useful to understand the problem and to be able to work around it.
Let’s start with a quick review of how a shell executes a program, passes command-line arguments to it, and handles its standard I/O.
When a shell executes a command line like this one…
$ grep "some words" * > /tmp/grepout
… the shell first parses the command line. The first word is the name of the program to execute; the shell finds the executable program at (in this case) /bin/grep. The quotes around
some words tell the shell not to split the string into two separate arguments at the space (so the first argument to /bin/grep is the string
some words, with the space and without the quotes). The shell then expands
* into the names of all entries in the current directory, then (assuming there aren’t too many) passes them on to grep. The remainder of the command-line,
> /tmp/grepout, redirects the standard output of the grep process to the file /tmp/grepout instead of to the terminal.
Figure One shows the resulting processes. If there are 999 files in the current directory, the shell passes grep 1,000 arguments. The first argument is the pattern to search for and the remaining 999 are filenames. (Actually, grep gets 1,001 arguments, where the very first argument is grep’ s name or its pathname.)
Figure One: Executing a simple command-line
Executing a command-line with xargs
If 1,000 arguments is “too many,” the shell won’t start grep; instead, the shell will print an error. To solve this, you can run grep more than once– each with a subset of the names from the current directory– so that all of the names don’t all appear on a command line at once.
One answer is to use xargs and pass the long list of arguments to its standard input (stdin). Figure Two shows this setup. xargs reads some of the arguments– the first 50, for instance– and executes the command
grep"some words" with those 50 filenames. Then xargs reads another 50 arguments from its stdin and executes
grep “some words” again with those 50 filenames. The shell has redirected the standard output of xargs to the output file /tmp/grepout; all the subprocesses of xargs (the child grep processes) also inherit the same standard output, and write their results there in sequence. So the effect is the same as the original command line, albeit with a lot of extra “plumbing.” If you aren’t familiar with how the shell arranges processes, it’s worth some study.
Now, back to xargs itself. By default, xargs reads data and arranges it into an unspecified number of “chunks.” You can control how much data xargs reads per chunk with its command-line options
–l max-lines (or, with GNU-style options,
––max-lines= max-lines) and
–n max-args (
––max-args= max-args). There are more options, too, and the manual page lists them. But let’s move on to other ways to group data.
Data Chunking with fmt
Linux has more-general tools for reorganizing data, and one is fmt. fmt reads words (separated by spaces and newlines) from its standard input and collects them into lines of data– by default, about 75 characters per line. Listing One shows an example of this” wrapping” text.