Slicing and Dicing on the Command Line

If you don't know text, you don't know Linux. There are a host of methods for reformatting plain text -- including the text used by graphical applications like spreadsheets and email programs.

Plain text is a series of characters delimited into lines by newline (LF, line feed) characters. You can send this text directly to a terminal window with a utility like cat(1). There are no hidden formatting codes; it’s “just the text, ma’am.”

Before the puns get any worse, let’s dig in!

Quick Review

As you saw in last month’s column (if you didn’t see the column, you might want to review it), to start a new line at any point in plain text, simply insert a newline character. To join two lines, remove the newline between them — and maybe add a space or TAB character to separate them.

When a terminal or printer reads a TAB character, it moves the current position to the next tabstop. TAB characters are also used as field separators; you can make a simple database with TABs between the fields and a newline at the end of each record.

Linux utilities can also reformat text that doesn’t contain TABs. We’ll see examples of that, too.

Lots of Possibilities

Many GNU utilities started in the days of Unix — back when a tty really was a teletype. Without a graphical display (or a graphical editor) to rearrange text, programmers came up with many ways to slice, dice, and reassemble data from scripts and the command line.

We’ll see some of those ways: Enough ways, I hope, that people new to this way of handling text will be ready to find other ways — and gurus will still get a few surprises.

Starting with a Spreadsheet

Plain text can come from lots of places, including:

  • The output of a utility (grep, for instance),
  • Text saved from an application (see Figure One for an example),
  • Text pasted into a terminal window from a graphical application, as in Figure Two near the end of this article.

Note that some of this text may not be “plain” characters. For instance, if you’re copying from a web page designed by a Macintosh user, the designer may have unwittingly included the Macintosh encoding of a special character (maybe a “curly quote”) that isn’t recognized on your Linux system.

For the first few examples, let’s use an OpenOffice.org spreadsheet file saved as plain text. (On the File menu, choose Save As, type Text CSV.) Assuming that the data doesn’t contain any TAB characters, you can set the Field Delimiter to TAB and the Text Delimiter to none (delete the default quote mark in that dialog box). Figure One shows this.

Figure One: Saving a spreadsheet as plain text
Figure One: Saving a spreadsheet as plain text

Below are are two views of the resulting file data.txt (renamed from the default data.csv). First, plain cat outputs the TAB characters between fields, which the terminal displays by moving to the next tabstop position. Next, cat -tve shows what’s actually in the file:

$ cat data.txt
STATE	CITY	COUNTY	POP.	GOVT.
AZ	Ely	Gila	123	Mayor
CA	Alma	Lolo	345	Sheriff
TX	Leroy	El Paso	22	Bubba
$ cat -tve data.txt
STATE^ICITY^ICOUNTY^IPOP.^IGOVT.$
AZ^IEly^IGila^I123^IMayor$
CA^IAlma^ILolo^I345^ISheriff$
TX^ILeroy^IEl Paso^I22^IBubba$

Checking the data file with cat -tve or od -c is a good idea. They’ll reveal “hidden” or “non-plain” characters buried in the data. Notice the space character in the field El Paso. Because the field separator is a TAB, the space doesn’t cause any problems.

Utilities that Understand TABs

Scripting languages (Perl, awk, …) can parse and write TAB-separated data. Table One lists some other Linux utilities that handle TABs.

Table One: Some utilities that understand TABs

Utility Description
cut(1) Remove sections from each line of files
echo(1), printf(1) Write arguments to standard output (\t makes a TAB)
expand(1), unexpand(1) Convert TABs to spaces, spaces to TABs
paste(1) Merge lines of files into TAB-separated output
sed(1) Stream editor
sort(1) Sort data by one or more of its fields

Whether your data comes from a spreadsheet or some other source, if you can massage your data into TAB-separated fields, the examples below can help you slice and dice it. Examples toward the end of the article cover other types of data.

Comments on "Slicing and Dicing on the Command Line"

clowenstein

Usual problems with quoting quote characters and not seeing invisible tabs.

In listing 1, the backslash has disappeared from each quoted tab \t.

In both listing 1 and listing 2, there is an invisible tab between the quotation marks in the statement:
while IFS = ” “

In listing 2, there is also an invisible tab in the statement:
echo -En “${field[n]} “

Reply
grdetil

The more obvious use of fmt is to wrap long lines, but it never occurred to me to use it to unwrap text as you illustrated! Thanks.

A note about sort -R to sort in random order: that’s a relatively new option that doesn’t exist on slightly older systems like RHEL 5. (Ditto for the -V option to sort by version number – another handy addition.) In an older script, I sorted in random order by using awk to prepend a random number to each line, similar to your trick to sort on the last word in a line:

awk ‘BEGIN {srand()}; {print rand(), $0}’ file | sort -k 1n | cut -d” ” -f2-

Reply
isabellf

Please stop posting short articles on multiple pages !

We can’t check the previous page quickly (for instance to get back to look at the sample data). And we get annoyed with the additional publicity.

Thank you

Reply
forgewire

That’s right. Especially when there is an annoying add between them.

Reply
mickza

Open the pages on separate tabs.

Reply
jp

@clowenstein, first, thanks for pointing out the missing backslashes in Listing One. I just fixed them.

Second, a way to make TABs more obvious (than the comment I put above each place they were used) is by defining a shell variable named, say, tab. Store a literal tab character into it, then use $tab each place you want a TAB.

Of course, the shell won’t expand $tab when it’s inside a pair of single quotes… but you can work around that by temporarily switching to double quotes. For instance, one way to echo the string $1TAB$2, where $1 and $2 are output literally (not expanded as parameters) is to put single quotes around $1 and $3, but double quotes around $tab. That is (and let’s hope the WordPress comment filters don’t mess this up):

$ echo '$1'"$tab"'$2'
$1 $2

Reply
jp

Thanks, @mickza, for mentioning browser tabs. In case anyone’s not familiar with this technique, it’s handy even for articles that appear on a long single page. For instance, if there’s a code listing that you want to refer to while reading other parts of the article, open the same article twice, in two browser tabs. Under the first tab, scroll the first copy of the article so the listing appears… and leave it positioned there. Read from the second tab — and click over to the first tab when you want to refer to the listing.

Reply

Definitely would love to start a website like yours. Wish I had the time. My site is so amateurish compared to yours, feel free to check it out: http://tinyurl.com/o55af8p Alex :)

Reply

Blog looks nice. I’m still trying to make a blog but it won’t be as professional as yours /: Keep on blogging :) pirater un compte facebook

Reply

[url=http://www.mt1BP99ye9Q617Saeobd35.com/]ucHOxFuu[/url]
ucHOxFuu http://www.mt1BP99ye9Q617Saeobd35.com/
ucHOxFuu

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>