Text plays a central role in the Linux operating system. Take better control of your system with a firm understanding of what text is and how best to handle, format and convert it.
Types of text files
The previous section showed a text file from a Windows system. Each complete line in a Windows file ends with CRLF (carriage return and line feed characters).
On Unix and Linux systems, each line of a text file ends with a newline (LF, line feed) character. To see that, we’ll repeat the previous example on a Linux box. Let’s use the built-in bash utility echo -en this time. (If you aren’t using the bash shell, try typing /bin/echo instead of echo.) The -e option tells echo to convert escape sequences into characters; for instance, when echo reads \t, it echoes a TAB character. The -n option suppresses the final newline character:
/tmp$ echo -en "\tline 1 \nline 2" > lin
/tmp$ cat lin
line 2/tmp$ cat -tve lin
^Iline 1 $
line 2/tmp$ od -c -w6 lin
0000000 \t l i n e
0000006 1 \n l i n
0000014 e 2
/tmp$ wc lin
1 4 15 lin
The difference between the previous Windows example and this one is that there’s no CR character here.
What about other OSes? Old Macintosh files use CR as a line terminator. New Macs, with Unix underneath, use the Unix LF terminator. (If you’d like to read more, here’s a Wikipedia article about “Newline”.)
Converting text files
Utilities like dos2unix, unix2dos, fromdos and todos add or remove a CR before each LF, as needed. Here’s another way to remove all CR (octal 15) characters from a file:
/tmp$ tr -d '\015' < dos_file > linux_file
If a line happens to have a carriage return in the middle of it, that tr method will strip those too. (For instance, to boldface a line on typewriter-like printers, make the printer’s carriage move to the start of the line and reprint the line several times, as in line^Mline^Mline.) To remove CR only at the end of a line, try sed; use \r in a regular expression to match CR and $ to match end-of-line:
/d/tmp$ sed 's/\r$//' win.txt | od -c -w6
0000000 \t l i n e
0000006 1 \n l i n
0000014 e 2
A TAB is a single character. On typewriters (remember them?) and printers with a moving head, a TAB moves the current position to the next tabstop location. Tabstops are typically 8 characters apart. For example, let’s say that the first position on a line is position 0, the second is position 1, and so on. So, if the cursor, printer head, etc., is in positions 0 through 7, a TAB will move to position 8. If the current position is in 8 through 15, a TAB moves to 16. We can see this using echo to output two lines. The X in the second line appears underneath the 8 in the first line, which is the next tabstop:
/tmp$ echo -e "0123456789\n..\tX"
These days, TABs are more useful as field separators. For instance, if you save a spreadsheet to a plain-text file, one row per line, with a TAB between each cell from a row, you can use powerful Linux text utilities to extract or rearrange the data. We’ll see some examples next month.
Once you know how text files are structured, you can rearrange the newlines. Want to doublespace some text — that is, add a blank line after every line? At the end of each line, add another newline. Let’s do it with sed. First, make a two-line test file:
/tmp$ echo -e "Line A\nLine B" > sed.in
/tmp$ cat sed.in
/tmp$ sed 's/$/\
> /' sed.in > sed.out
/tmp$ cat -e sed.out
(When you type a multiline command at a shell prompt, bash uses its secondary prompt > until the command is complete.) cat -e shows that sed.out has an empty line after each line.
What argument is sed actually getting? To find out, use echo -En to write the same argument to od -c. Using -E makes sure echo doesn’t interpret the backslash:
/tmp$ echo -En 's/$/\
> /' | od -c
0000000 s / $ / \ \n /
od shows that, after the backslash at the end of the first line (which is required so sed knows that its s command hasn’t finished), sed gets a newline (LF) character on the replacement side of the s command. So, at each existing newline, sed is adding another newline.
How can you break an input with many lines into “chunks” of, say, 56 lines each? The pr(1) utility does this; it outputs a file or standard input in 56-line chunks, with a 5-line header before each chunk and a five-line footer after each chunk. (This fits printers that print 66 lines per 11-inch page.) If you have any long set of data, a technique like this can cut it into more-manageable pieces.
A shell loop with redirected input can do the same thing as pr, and the technique is worth knowing when you need chunks of text. Listing One shows an example. The files named on the chunk56 command line — or the standard input, if no files are named — are written to the loop’s standard input. Each call to head -56 reads the next 56 lines of the loop’s standard input, which the shell stores in $chunk. Three echo commands output a 5-line header, the body, and a 5-line footer. Once $chunk is empty, the endless loop (driven by the shell’s colon operator) is broken.
Listing One: chunk56: outputs 66-line pages
cat "$@" |
if [[ -n $chunk ]]
echo -e "\n\nPAGE $page\n\n"
echo -E "$chunk"
echo -e "\n\n\n\n"
Two tools for doing similar “chunking” — but writing the input to individual files instead of to standard output — are split(1) and csplit(1). As a quick example, here’s how you could split the output of someprog into a subdirectory named chunks, in 500-line files named chk.aa, chk.ab, and so on, then process each chunk:
$ mkdir chunks
$ someprog | split -l 500 chunks/chk.
$ cd chunks
$ for file in chk.*
$ rm chk.*
(Much) more next month…
Now that we’ve got the basics down, you’re ready to slice and dice text. We’ll do a lot of that in the next segment of this series.