Slicing and Dicing on the Command Line

If you don't know text, you don't know Linux. There are a host of methods for reformatting plain text -- including the text used by graphical applications like spreadsheets and email programs.

(If your data is in spreadsheet format, the tools built in to the spreadsheet program — including built-in macros and programming — may do the job. But imagine you have 200 spreadsheets, each with 50 columns to rearrange. Assuming that the spreadsheet’s built-in tools can do the job, do you really want to make all of those edits with a mouse and a bunch of dialog boxes? Or would you rather write a little loop from a shell prompt that edits automatically?)

Rearranging Fields with cut

The handy cut(1) utility separates text into fields, then outputs the fields you select. The default field delimiter is TAB; the -d option changes that. List the output fields after the -f option. The field list is one or more numbers, starting at 1; separate individual field numbers with a comma (,) or give a range of fields with a dash (M-N). A range can be incomplete: -N means fields 1-N, and M- means fields M through the last. Here are a few examples with our sample data file:

$ cut -f1,5 data.txt
STATE	GOVT.
AZ	Mayor
CA	Sheriff
TX	Bubba
$ cut -f5,1 data.txt
STATE	GOVT.
AZ	Mayor
CA	Sheriff
TX	Bubba
$ cut -f3- data.txt | cat -vte
COUNTY^IPOP.^IGOVT.$
Gila^I123^IMayor$
Lolo^I345^ISheriff$
El Paso^I22^IBubba$

As you can see, cut ignores the order of field numbers after -f. To output the fields in a different order — for instance, to rearrange columns from a spreadsheet — you need a better tool. One is awk(1). Let’s enter everything from the command line: set the input and output field separators to TAB, then print the input fields 4, 3, 5, 2, and 1:

$ awk -F '\t' -v 'OFS=\t' '{print $4, $3, $5, $2, $1}' data.txt
POP.	COUNTY	GOVT.	CITY	STATE
123	Gila	Mayor	Ely	AZ
345	Lolo	Sheriff	Alma	CA
22	El Paso	Bubba	Leroy	TX

Of course, you can use your favorite scripting language (if it isn’t awk)… and that includes a shell. Here are two bash scripts — which you could also enter from the command line if you’d rather. They both set IFS, the shell’s internal field separator, to a TAB character while using read to read a line of data from the standard input. The -a option writes the fields into the array $f. Then echo outputs the fields in the order we want.

Listing One: Reordering fields with bash, version 1

#!/bin/bash
# Set $IFS to TAB only during "read":
while IFS="     " read -r -a f
do echo -e "${f[3]}\t${f[2]}\t${f[4]}\t${f[1]}\t${f[0]}"
done

The script in Listing One has a couple of problems. One is that it uses echo -e to interpret the \t (TAB) escape sequences. If any of the data fields contain a backslash (\), too, echo may interpret those — and corrupt the data. The other problem is that, if your data has a lot of fields, passing a long list of arguments to echo is ugly and tedious.

The version in Listing Two uses nested loops. The inner for loop steps through all field numbers except the last, using echo -En to print the field contents and a TAB — without any backslash interpretation, and without a final newline. After the for loop finishes, a final echo -E outputs the last field and a newline:

Listing Two: Reordering fields with bash, version 2

#!/bin/bash
# Set $IFS to TAB only during "read":
while IFS="     " read -r -a field
do
  for n in 3 2 4 1
  do
    # Embedded TAB at end of string:
    echo -En "${field[n]}     "
  done
  echo -E "${field[0]}"
done

All three scripts above show a specific example of a general technique for rearranging data on a line: split the line into pieces (at TABs or some other place), then output those pieces in the order you want.

Modifying Some Lines

The techniques in this section are obvious, but worth mentioning. How can you output only certain input lines?

  1. If you can write a regular expression to match the lines you want, try grep or egrep.
  2. Using sed -n tells sed to output only the lines you choose (instead of the default, which is to output all lines). For instance:
    • sed -n '13,$p' outputs lines 13 through the end of input.
    • sed -n '/start/,/end/' outputs all lines between one containing start and one containing end.
    • sed -n '10p;22p;93,95p' outputs lines 10, 22, 93, 94, and 95.
    • Of course, you can do more. Making a sed script file is sometimes simpler than typing long expressions at a shell prompt.
  3. Use a scripting language with a patern match. For example, here’s an awk expression that outputs a record only if the second field contains this:

    $2 ~ /this/ {print $4, $3, $5, $2, $1}

Reordering Lines

Scripting languages can also reorder lines (or, in database terminology, records). (Because the lines need to be buffered or read from a temporary file, not all utilities are designed to reorder lines.) The GNU version of sort has a lot of ways to sort data — including random order (!) with sort -R.

Modern versions of sort use the -k option to specify sort fields. But how can you sort on the last word (field) in a line, if lines don’t all have the same number of words? Here’s a trick: copy the last word to the start of the line, sort on the first field in that temporary line, then strip off the first field. Watch:

$ cat salaries
Yvette van der Hoff 100000
Barack Obama 400000
Bernie Madoff 0
John Q. Public 20000
$ awk '{print $NF, $0}' salaries | sort -nr -k1,1 | cut -d" " -f2-
Barack Obama 400000
Yvette van der Hoff 100000
John Q. Public 20000
Bernie Madoff 0

Re-flowing Text with fmt

The fmt utility reads words, separated by spaces, and outputs them reformatted into lines of approximately equal width. There’s an introduction in the section “Data Chunking with fmt” of the article More Data Surgery.

A not-so-obvious use of fmt is converting a paragraph from a group of newline-terminated lines (with “hard” line breaks) into a single long, wrapped line (“soft” line breaks). This can be very handy with copy-and-paste from one program to another. For instance, an email message often appears as a bunch of newline-terminated lines. If you copy and paste those “jagged” lines into another window, such as a word processor, the lines won’t flow smoothly into a paragraph. Figure Two shows how to fix this with fmt.

Figure Two: Flowing lines into paragraphs with fmt
Figure Two: Flowing lines into paragraphs with fmt

The command cat > email reads its standard input and writes it to the file email. Using fmt -w 2000 outputs that file with lines 2,000 characters long (2000 is an arbitrary large number). Copy the fmt output with your mouse and paste it into the other GUI. (Note that fmt normally outputs two spaces at the end of a sentence. If you want a single space instead, pipe fmt‘s output to tr -s ' '.)

Two more notes before we wrap this up: GNU fmt has several options to control formatting. There are alternatives to fmt, too, including scripts in languages like Perl that handle formatting the way you want it.

Jerry Peek is a freelance writer and instructor who has used Unix and Linux for more than 25 years. He's happy to hear from readers; see http://www.jpeek.com/contact.html.

Comments on "Slicing and Dicing on the Command Line"

clowenstein

Usual problems with quoting quote characters and not seeing invisible tabs.

In listing 1, the backslash has disappeared from each quoted tab \t.

In both listing 1 and listing 2, there is an invisible tab between the quotation marks in the statement:
while IFS = ” “

In listing 2, there is also an invisible tab in the statement:
echo -En “${field[n]} “

Reply
grdetil

The more obvious use of fmt is to wrap long lines, but it never occurred to me to use it to unwrap text as you illustrated! Thanks.

A note about sort -R to sort in random order: that’s a relatively new option that doesn’t exist on slightly older systems like RHEL 5. (Ditto for the -V option to sort by version number – another handy addition.) In an older script, I sorted in random order by using awk to prepend a random number to each line, similar to your trick to sort on the last word in a line:

awk ‘BEGIN {srand()}; {print rand(), $0}’ file | sort -k 1n | cut -d” ” -f2-

Reply
isabellf

Please stop posting short articles on multiple pages !

We can’t check the previous page quickly (for instance to get back to look at the sample data). And we get annoyed with the additional publicity.

Thank you

Reply
forgewire

That’s right. Especially when there is an annoying add between them.

Reply
mickza

Open the pages on separate tabs.

Reply
jp

@clowenstein, first, thanks for pointing out the missing backslashes in Listing One. I just fixed them.

Second, a way to make TABs more obvious (than the comment I put above each place they were used) is by defining a shell variable named, say, tab. Store a literal tab character into it, then use $tab each place you want a TAB.

Of course, the shell won’t expand $tab when it’s inside a pair of single quotes… but you can work around that by temporarily switching to double quotes. For instance, one way to echo the string $1TAB$2, where $1 and $2 are output literally (not expanded as parameters) is to put single quotes around $1 and $3, but double quotes around $tab. That is (and let’s hope the WordPress comment filters don’t mess this up):

$ echo '$1'"$tab"'$2'
$1 $2

Reply
jp

Thanks, @mickza, for mentioning browser tabs. In case anyone’s not familiar with this technique, it’s handy even for articles that appear on a long single page. For instance, if there’s a code listing that you want to refer to while reading other parts of the article, open the same article twice, in two browser tabs. Under the first tab, scroll the first copy of the article so the listing appears… and leave it positioned there. Read from the second tab — and click over to the first tab when you want to refer to the listing.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>