Linux Runs on Text: Understanding & Handling Text

Text plays a central role in the Linux operating system. Take better control of your system with a firm understanding of what text is and how best to handle, format and convert it.

Types of text files

The previous section showed a text file from a Windows system. Each complete line in a Windows file ends with CRLF (carriage return and line feed characters).

On Unix and Linux systems, each line of a text file ends with a newline (LF, line feed) character. To see that, we’ll repeat the previous example on a Linux box. Let’s use the built-in bash utility echo -en this time. (If you aren’t using the bash shell, try typing /bin/echo instead of echo.) The -e option tells echo to convert escape sequences into characters; for instance, when echo reads \t, it echoes a TAB character. The -n option suppresses the final newline character:

/tmp$ echo -en "\tline 1 \nline 2" > lin
/tmp$ cat lin
        line 1
line 2/tmp$ cat -tve lin
^Iline 1 $
line 2/tmp$ od -c -w6 lin
0000000  \t   l   i   n   e
0000006   1      \n   l   i   n
0000014   e       2
0000017
/tmp$ wc lin
 1  4 15 lin
/tmp$

The difference between the previous Windows example and this one is that there’s no CR character here.

What about other OSes? Old Macintosh files use CR as a line terminator. New Macs, with Unix underneath, use the Unix LF terminator. (If you’d like to read more, here’s a Wikipedia article about “Newline”.)

Converting text files

Utilities like dos2unix, unix2dos, fromdos and todos add or remove a CR before each LF, as needed. Here’s another way to remove all CR (octal 15) characters from a file:

/tmp$ tr -d '\015' < dos_file > linux_file

If a line happens to have a carriage return in the middle of it, that tr method will strip those too. (For instance, to boldface a line on typewriter-like printers, make the printer’s carriage move to the start of the line and reprint the line several times, as in line^Mline^Mline.) To remove CR only at the end of a line, try sed; use \r in a regular expression to match CR and $ to match end-of-line:

/d/tmp$ sed 's/\r$//' win.txt | od -c -w6
0000000  \t   l   i   n   e
0000006   1      \n   l   i   n
0000014   e       2
0000017

Tabs

A TAB is a single character. On typewriters (remember them?) and printers with a moving head, a TAB moves the current position to the next tabstop location. Tabstops are typically 8 characters apart. For example, let’s say that the first position on a line is position 0, the second is position 1, and so on. So, if the cursor, printer head, etc., is in positions 0 through 7, a TAB will move to position 8. If the current position is in 8 through 15, a TAB moves to 16. We can see this using echo to output two lines. The X in the second line appears underneath the 8 in the first line, which is the next tabstop:

/tmp$ echo -e "0123456789\n..\tX"
0123456789
..      X

These days, TABs are more useful as field separators. For instance, if you save a spreadsheet to a plain-text file, one row per line, with a TAB between each cell from a row, you can use powerful Linux text utilities to extract or rearrange the data. We’ll see some examples next month.

Doublespacing

Once you know how text files are structured, you can rearrange the newlines. Want to doublespace some text — that is, add a blank line after every line? At the end of each line, add another newline. Let’s do it with sed. First, make a two-line test file:

/tmp$ echo -e "Line A\nLine B" > sed.in
/tmp$ cat sed.in
Line A
Line B
/tmp$ sed 's/$/\
> /' sed.in > sed.out
/tmp$ cat -e sed.out
Line A$
$
Line B$
$

(When you type a multiline command at a shell prompt, bash uses its secondary prompt > until the command is complete.) cat -e shows that sed.out has an empty line after each line.

What argument is sed actually getting? To find out, use echo -En to write the same argument to od -c. Using -E makes sure echo doesn’t interpret the backslash:

/tmp$ echo -En 's/$/\
> /' | od -c
0000000   s   /   $   /   \  \n   /
0000007

od shows that, after the backslash at the end of the first line (which is required so sed knows that its s command hasn’t finished), sed gets a newline (LF) character on the replacement side of the s command. So, at each existing newline, sed is adding another newline.

Sidebar One: “Extra credit” sed question

Q: If sed can match a CR in s/\r/, why can’t it match a newline with s/\n/?

A: Because sed normally reads input line-by-line, and a newline is the end of the line, there’s no newline for sed to match.

(Actually, sed can include and match newlines with its multiline commands like G and N. There’s an example at the end of (Very) Small Editors.)

“Chunking” lines

How can you break an input with many lines into “chunks” of, say, 56 lines each? The pr(1) utility does this; it outputs a file or standard input in 56-line chunks, with a 5-line header before each chunk and a five-line footer after each chunk. (This fits printers that print 66 lines per 11-inch page.) If you have any long set of data, a technique like this can cut it into more-manageable pieces.

A shell loop with redirected input can do the same thing as pr, and the technique is worth knowing when you need chunks of text. Listing One shows an example. The files named on the chunk56 command line — or the standard input, if no files are named — are written to the loop’s standard input. Each call to head -56 reads the next 56 lines of the loop’s standard input, which the shell stores in $chunk. Three echo commands output a 5-line header, the body, and a 5-line footer. Once $chunk is empty, the endless loop (driven by the shell’s colon operator) is broken.

Listing One: chunk56: outputs 66-line pages

#!/bin/bash
page=1
cat "$@" |
while :
do
  chunk=$(head -56)
  if [[ -n $chunk ]]
  then
    echo -e "\n\nPAGE $page\n\n"
    echo -E "$chunk"
    echo -e "\n\n\n\n"
    let page=page+1
  else
    break
  fi
done

Two tools for doing similar “chunking” — but writing the input to individual files instead of to standard output — are split(1) and csplit(1). As a quick example, here’s how you could split the output of someprog into a subdirectory named chunks, in 500-line files named chk.aa, chk.ab, and so on, then process each chunk:

$ mkdir chunks
$ someprog | split -l 500 chunks/chk.
$ cd chunks
$ for file in chk.*
> do
>   ...
> done
$ rm chk.*

(Much) more next month…

Now that we’ve got the basics down, you’re ready to slice and dice text. We’ll do a lot of that in the next segment of this series.

Jerry Peek is a freelance writer and instructor who has used Unix and Linux for more than 25 years. He's happy to hear from readers; see http://www.jpeek.com/contact.html.

Comments on "Linux Runs on Text: Understanding & Handling Text"

rhkramer

Did you really have to break the following line between the last two “/”s?

=
sed ’s/$/\
> /’ sed.in > sed.out
=

I guess I can figure out whether the s “command” should be ‘s/$//’ or ‘s/$/ /’, but you could have made it easier.

Hmm, guess I can’t (or not as easily as I’d hoped):

* first, I guess the “’” is a fancy curly quote or something, as my system (Debian 5.0) seems to balk at it:
sed: -e expression #1, char 1: unknown command: `�’

* second, even replacing the “’” with “‘” (and putting the command all on one line, with the substitution looking like this: ‘s/$//’ doesn’t add a newline after each existing line:

rhk@s17:/rhk$ sed ’s/$//’ sed.in > sed.out
sed: -e expression #1, char 1: unknown command: `�’
rhk@s17:/rhk$ sed ‘s/$//’ sed.in > sed.out
rhk@s17:/rhk$ cat -e sed.out
Line A$
Line B$
rhk@s17:/rhk$

Randy Kramer

PS: Just because I’m on a complaining “roll”, I’ll make one more–why require javascript to post this comment? After reading the article and writing the comment in konqueror (in which I keep javascript turned off (partly because it doesn’t seem to work reliably for all sites anyway, and partly because I don’t want the insecurity of javascript), I had to switch to a different browser to post this comment. (I think I’ll go back to bed and try to get out on the other side.)

Reply
rhkramer

Oops, I’m not sure I needed javascript to post the comment–I have cookies rejected from your site (like most sites)–maybe that was the problem. Anyway, I don’t intend to test it at this time.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>