Slicing and Dicing on the Command Line

If you don't know text, you don't know Linux. There are a host of methods for reformatting plain text -- including the text used by graphical applications like spreadsheets and email programs.

(If your data is in spreadsheet format, the tools built in to the spreadsheet program — including built-in macros and programming — may do the job. But imagine you have 200 spreadsheets, each with 50 columns to rearrange. Assuming that the spreadsheet’s built-in tools can do the job, do you really want to make all of those edits with a mouse and a bunch of dialog boxes? Or would you rather write a little loop from a shell prompt that edits automatically?)

Rearranging Fields with cut

The handy cut(1) utility separates text into fields, then outputs the fields you select. The default field delimiter is TAB; the -d option changes that. List the output fields after the -f option. The field list is one or more numbers, starting at 1; separate individual field numbers with a comma (,) or give a range of fields with a dash (M-N). A range can be incomplete: -N means fields 1-N, and M- means fields M through the last. Here are a few examples with our sample data file:

$ cut -f1,5 data.txt
STATE	GOVT.
AZ	Mayor
CA	Sheriff
TX	Bubba
$ cut -f5,1 data.txt
STATE	GOVT.
AZ	Mayor
CA	Sheriff
TX	Bubba
$ cut -f3- data.txt | cat -vte
COUNTY^IPOP.^IGOVT.$
Gila^I123^IMayor$
Lolo^I345^ISheriff$
El Paso^I22^IBubba$

As you can see, cut ignores the order of field numbers after -f. To output the fields in a different order — for instance, to rearrange columns from a spreadsheet — you need a better tool. One is awk(1). Let’s enter everything from the command line: set the input and output field separators to TAB, then print the input fields 4, 3, 5, 2, and 1:

$ awk -F '\t' -v 'OFS=\t' '{print $4, $3, $5, $2, $1}' data.txt
POP.	COUNTY	GOVT.	CITY	STATE
123	Gila	Mayor	Ely	AZ
345	Lolo	Sheriff	Alma	CA
22	El Paso	Bubba	Leroy	TX

Of course, you can use your favorite scripting language (if it isn’t awk)… and that includes a shell. Here are two bash scripts — which you could also enter from the command line if you’d rather. They both set IFS, the shell’s internal field separator, to a TAB character while using read to read a line of data from the standard input. The -a option writes the fields into the array $f. Then echo outputs the fields in the order we want.

Listing One: Reordering fields with bash, version 1

#!/bin/bash
# Set $IFS to TAB only during "read":
while IFS="     " read -r -a f
do echo -e "${f[3]}\t${f[2]}\t${f[4]}\t${f[1]}\t${f[0]}"
done

The script in Listing One has a couple of problems. One is that it uses echo -e to interpret the \t (TAB) escape sequences. If any of the data fields contain a backslash (\), too, echo may interpret those — and corrupt the data. The other problem is that, if your data has a lot of fields, passing a long list of arguments to echo is ugly and tedious.

The version in Listing Two uses nested loops. The inner for loop steps through all field numbers except the last, using echo -En to print the field contents and a TAB — without any backslash interpretation, and without a final newline. After the for loop finishes, a final echo -E outputs the last field and a newline:

Listing Two: Reordering fields with bash, version 2

#!/bin/bash
# Set $IFS to TAB only during "read":
while IFS="     " read -r -a field
do
  for n in 3 2 4 1
  do
    # Embedded TAB at end of string:
    echo -En "${field[n]}     "
  done
  echo -E "${field[0]}"
done

All three scripts above show a specific example of a general technique for rearranging data on a line: split the line into pieces (at TABs or some other place), then output those pieces in the order you want.

Modifying Some Lines

The techniques in this section are obvious, but worth mentioning. How can you output only certain input lines?

  1. If you can write a regular expression to match the lines you want, try grep or egrep.
  2. Using sed -n tells sed to output only the lines you choose (instead of the default, which is to output all lines). For instance:
    • sed -n '13,$p' outputs lines 13 through the end of input.
    • sed -n '/start/,/end/' outputs all lines between one containing start and one containing end.
    • sed -n '10p;22p;93,95p' outputs lines 10, 22, 93, 94, and 95.
    • Of course, you can do more. Making a sed script file is sometimes simpler than typing long expressions at a shell prompt.
  3. Use a scripting language with a patern match. For example, here’s an awk expression that outputs a record only if the second field contains this:

    $2 ~ /this/ {print $4, $3, $5, $2, $1}

Reordering Lines

Scripting languages can also reorder lines (or, in database terminology, records). (Because the lines need to be buffered or read from a temporary file, not all utilities are designed to reorder lines.) The GNU version of sort has a lot of ways to sort data — including random order (!) with sort -R.

Modern versions of sort use the -k option to specify sort fields. But how can you sort on the last word (field) in a line, if lines don’t all have the same number of words? Here’s a trick: copy the last word to the start of the line, sort on the first field in that temporary line, then strip off the first field. Watch:

$ cat salaries
Yvette van der Hoff 100000
Barack Obama 400000
Bernie Madoff 0
John Q. Public 20000
$ awk '{print $NF, $0}' salaries | sort -nr -k1,1 | cut -d" " -f2-
Barack Obama 400000
Yvette van der Hoff 100000
John Q. Public 20000
Bernie Madoff 0

Re-flowing Text with fmt

The fmt utility reads words, separated by spaces, and outputs them reformatted into lines of approximately equal width. There’s an introduction in the section “Data Chunking with fmt” of the article More Data Surgery.

A not-so-obvious use of fmt is converting a paragraph from a group of newline-terminated lines (with “hard” line breaks) into a single long, wrapped line (“soft” line breaks). This can be very handy with copy-and-paste from one program to another. For instance, an email message often appears as a bunch of newline-terminated lines. If you copy and paste those “jagged” lines into another window, such as a word processor, the lines won’t flow smoothly into a paragraph. Figure Two shows how to fix this with fmt.

Figure Two: Flowing lines into paragraphs with fmt
Figure Two: Flowing lines into paragraphs with fmt

The command cat > email reads its standard input and writes it to the file email. Using fmt -w 2000 outputs that file with lines 2,000 characters long (2000 is an arbitrary large number). Copy the fmt output with your mouse and paste it into the other GUI. (Note that fmt normally outputs two spaces at the end of a sentence. If you want a single space instead, pipe fmt‘s output to tr -s ' '.)

Two more notes before we wrap this up: GNU fmt has several options to control formatting. There are alternatives to fmt, too, including scripts in languages like Perl that handle formatting the way you want it.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62