dcsimg

Linux Runs on Text: Understanding & Handling Text

Text plays a central role in the Linux operating system. Take better control of your system with a firm understanding of what text is and how best to handle, format and convert it.

Types of text files

The previous section showed a text file from a Windows system. Each complete line in a Windows file ends with CRLF (carriage return and line feed characters).

On Unix and Linux systems, each line of a text file ends with a newline (LF, line feed) character. To see that, we’ll repeat the previous example on a Linux box. Let’s use the built-in bash utility echo -en this time. (If you aren’t using the bash shell, try typing /bin/echo instead of echo.) The -e option tells echo to convert escape sequences into characters; for instance, when echo reads \t, it echoes a TAB character. The -n option suppresses the final newline character:

/tmp$ echo -en "\tline 1 \nline 2" > lin
/tmp$ cat lin
        line 1
line 2/tmp$ cat -tve lin
^Iline 1 $
line 2/tmp$ od -c -w6 lin
0000000  \t   l   i   n   e
0000006   1      \n   l   i   n
0000014   e       2
0000017
/tmp$ wc lin
 1  4 15 lin
/tmp$

The difference between the previous Windows example and this one is that there’s no CR character here.

What about other OSes? Old Macintosh files use CR as a line terminator. New Macs, with Unix underneath, use the Unix LF terminator. (If you’d like to read more, here’s a Wikipedia article about “Newline”.)

Converting text files

Utilities like dos2unix, unix2dos, fromdos and todos add or remove a CR before each LF, as needed. Here’s another way to remove all CR (octal 15) characters from a file:

/tmp$ tr -d '\015' < dos_file > linux_file

If a line happens to have a carriage return in the middle of it, that tr method will strip those too. (For instance, to boldface a line on typewriter-like printers, make the printer’s carriage move to the start of the line and reprint the line several times, as in line^Mline^Mline.) To remove CR only at the end of a line, try sed; use \r in a regular expression to match CR and $ to match end-of-line:

/d/tmp$ sed 's/\r$//' win.txt | od -c -w6
0000000  \t   l   i   n   e
0000006   1      \n   l   i   n
0000014   e       2
0000017

Tabs

A TAB is a single character. On typewriters (remember them?) and printers with a moving head, a TAB moves the current position to the next tabstop location. Tabstops are typically 8 characters apart. For example, let’s say that the first position on a line is position 0, the second is position 1, and so on. So, if the cursor, printer head, etc., is in positions 0 through 7, a TAB will move to position 8. If the current position is in 8 through 15, a TAB moves to 16. We can see this using echo to output two lines. The X in the second line appears underneath the 8 in the first line, which is the next tabstop:

/tmp$ echo -e "0123456789\n..\tX"
0123456789
..      X

These days, TABs are more useful as field separators. For instance, if you save a spreadsheet to a plain-text file, one row per line, with a TAB between each cell from a row, you can use powerful Linux text utilities to extract or rearrange the data. We’ll see some examples next month.

Doublespacing

Once you know how text files are structured, you can rearrange the newlines. Want to doublespace some text — that is, add a blank line after every line? At the end of each line, add another newline. Let’s do it with sed. First, make a two-line test file:

/tmp$ echo -e "Line A\nLine B" > sed.in
/tmp$ cat sed.in
Line A
Line B
/tmp$ sed 's/$/\
> /' sed.in > sed.out
/tmp$ cat -e sed.out
Line A$
$
Line B$
$

(When you type a multiline command at a shell prompt, bash uses its secondary prompt > until the command is complete.) cat -e shows that sed.out has an empty line after each line.

What argument is sed actually getting? To find out, use echo -En to write the same argument to od -c. Using -E makes sure echo doesn’t interpret the backslash:

/tmp$ echo -En 's/$/\
> /' | od -c
0000000   s   /   $   /   \  \n   /
0000007

od shows that, after the backslash at the end of the first line (which is required so sed knows that its s command hasn’t finished), sed gets a newline (LF) character on the replacement side of the s command. So, at each existing newline, sed is adding another newline.

Sidebar One: “Extra credit” sed question

Q: If sed can match a CR in s/\r/, why can’t it match a newline with s/\n/?

A: Because sed normally reads input line-by-line, and a newline is the end of the line, there’s no newline for sed to match.

(Actually, sed can include and match newlines with its multiline commands like G and N. There’s an example at the end of (Very) Small Editors.)

“Chunking” lines

How can you break an input with many lines into “chunks” of, say, 56 lines each? The pr(1) utility does this; it outputs a file or standard input in 56-line chunks, with a 5-line header before each chunk and a five-line footer after each chunk. (This fits printers that print 66 lines per 11-inch page.) If you have any long set of data, a technique like this can cut it into more-manageable pieces.

A shell loop with redirected input can do the same thing as pr, and the technique is worth knowing when you need chunks of text. Listing One shows an example. The files named on the chunk56 command line — or the standard input, if no files are named — are written to the loop’s standard input. Each call to head -56 reads the next 56 lines of the loop’s standard input, which the shell stores in $chunk. Three echo commands output a 5-line header, the body, and a 5-line footer. Once $chunk is empty, the endless loop (driven by the shell’s colon operator) is broken.

Listing One: chunk56: outputs 66-line pages

#!/bin/bash
page=1
cat "$@" |
while :
do
  chunk=$(head -56)
  if [[ -n $chunk ]]
  then
    echo -e "\n\nPAGE $page\n\n"
    echo -E "$chunk"
    echo -e "\n\n\n\n"
    let page=page+1
  else
    break
  fi
done

Two tools for doing similar “chunking” — but writing the input to individual files instead of to standard output — are split(1) and csplit(1). As a quick example, here’s how you could split the output of someprog into a subdirectory named chunks, in 500-line files named chk.aa, chk.ab, and so on, then process each chunk:

$ mkdir chunks
$ someprog | split -l 500 chunks/chk.
$ cd chunks
$ for file in chk.*
> do
>   ...
> done
$ rm chk.*

(Much) more next month…

Now that we’ve got the basics down, you’re ready to slice and dice text. We’ll do a lot of that in the next segment of this series.

Comments on "Linux Runs on Text: Understanding & Handling Text"

Wonderful story, reckoned we could combine some unrelated information, nevertheless truly really worth taking a appear, whoa did one master about Mid East has got much more problerms also.

Every when in a although we choose blogs that we study. Listed beneath are the latest web sites that we choose.

Check below, are some entirely unrelated sites to ours, even so, they are most trustworthy sources that we use.

Wonderful story, reckoned we could combine a couple of unrelated information, nonetheless seriously really worth taking a appear, whoa did one particular understand about Mid East has got far more problerms at the same time.

The time to study or take a look at the material or sites we’ve linked to below.

Check below, are some completely unrelated web-sites to ours, nonetheless, they are most trustworthy sources that we use.

We like to honor several other world-wide-web web pages around the internet, even if they aren?t linked to us, by linking to them. Beneath are some webpages worth checking out.

Leave a Reply