Linux Runs on Text: Understanding & Handling Text

Text plays a central role in the Linux operating system. Take better control of your system with a firm understanding of what text is and how best to handle, format and convert it.

This month, as another part of the series about using text on Linux systems, we’ll introduce “plain text” and how you can restructure it. We’ll see how to identify text from different systems (Unix, DOS, Mac) and to convert text between systems. The article ends with some examples, and there’ll be lots more next month.

If you’re used to clicking on files to view and edit them, you’ll probably find some new tools and concepts here. Gurus, please have a look at the main example and be sure it’s familiar.

The fundamental concept is the role of the newline (line feed) character. Reformatting text is basically a matter of juggling newlines. Let’s dig in.

What’s Text?

As I wrote in last month’s column, Linux runs on text. Text comes in a lot of flavors. What we’ll cover this month is plain text: a stream of characters that you can output directly to a terminal window using a utility like cat(1). The text doesn’t have special formatting codes like “start boldface” or “24 pixels high” that are only understood by certain operating systems or applications. Plain text doesn’t require a word processing program like OpenOffice.org Writer to interpret instructions buried before and between the actual text.

Let’s make a test text file. We’ll use it to demonstrate a lot of things about text. Although the example is a bit tedious, the techniques will be useful, later, when you need to know what’s in a text file. We’ll make this text file on a Windows system, and make a similar file later under Linux.

On Microsoft Windows systems, each line of a plain text file ends with a carriage return (CR) character followed by a newline (LF, line feed) character. Let’s use a “Command prompt” window to copy text typed from the keyboard (con:) into a Windows-format file named win.txt. On DOS-type systems, pressing CTRL-Z followed by the ENTER key ends input. The boldfaced text is user input; the rest is system output:

D:\tmp>copy con: win.txt
        line 1
line 2^Z
        1 file(s) copied.

D:\tmp>

Next, let’s read that file from a Linux-type terminal window — for instance, under Cygwin or from a Windows filesystem cross-mounted onto a Linux host. We’ll look at the file with cat, cat -tve, and od -c; and count the number of lines, words, and characters using wc(1). (For an introduction, see the section “od and friends” of Wizard Boot Camp, Part 10.)

/d/tmp$ cat win.txt
        line 1
line 2/d/tmp$
/d/tmp$ cat -tve win.txt
^Iline 1 ^M$
line 2/d/tmp$
/d/tmp$ od -c -w6 win.txt
0000000  \t   l   i   n   e
0000006   1      \r  \n   l   i
0000014   n   e       2
0000020
/d/tmp$ wc win.txt
 1  4 16 win.txt
/d/tmp$

Here are details. (You might open another copy of this page in a separate window so you can see the example while reading the details.)

  1. The first command, cat win.txt, shows a file that looks like the text we entered in the DOS window. However, the bash shell prompt, /d/tmp$, comes just after the text line 2 from the file — instead of on a new line by itself.

    Why? It’s because (as we’ll see below) the contents of win.txt don’t end with a newline character. The shell always prints a prompt immediately after the output of a command (in this case, the cat utility) finishes. There’s no newline at the end of the file, so the shell prompt appears on the same line.

  2. The second command shows a lot more:

    1. The option -t tells cat to show TAB characters as ^I, so you can see that the indentation before line 1 is caused by a TAB.
    2. The option -v tells cat to show “nonprinting” characters visibly, which lets us see that there’s a carriage return character, shown as ^M, after a space character, following the text line 1.

      Each line of a DOS text file ends with two characters: a carriage return and a newline (line feed). After showing the carriage return visibly, cat output the newline preceded by a $ character:

    3. The option -e tells cat to mark the end of a line with $. This lets you see just where a newline falls.
  3. The third command, od -c, shows the character representation of bytes one-by-one. The -w6 option lists six bytes per line. Each line starts with the octal offset from the start of the file. You can see:

    1. The first six bytes (at offsets 0000000 through 0000005) are a TAB character (which od shows as \t), the word line and a space character.
    2. The second six bytes (from offset 0000006) are the digit 1, a space character, a carriage return character (which od shows as \r), a newline character (which od shows as \n), and the first two characters of the next line of the file, li.

      od shows the structure of a text file. The newline character — the end of the first “line” in the file — is just a character. The bytes of the next “line” start immediately after the newline. (As we’ll see later, you can insert newline characters anywhere you want to start new lines.)

    3. The last four bytes (octal offsets 0000014 through 0000017) are the letters n, e, a space, and the digit 2. There’s no carriage return, no newline. That’s because, while making the file, we typed the DOS end-of-input character CTRL-Z before pressing RETURN (ENTER) to end the line.
  4. The wc utility reports 1 line, 4 words, and 16 characters.

    1. Because there’s only one newline character, there’s only one “line”. (The second line isn’t complete.)
    2. There are four words: line, 1, line, and 2.
    3. The 16 characters include the carriage return and the newline. (You can see them in the od output, and see the the final offset — 0000020 octal, which shows the number of bytes read — is 16 decimal.) Although the TAB makes a lot of whitespace (it moves the cursor to the next “tab stop” on the terminal, as we’ll see below), it’s only a single character.

Comments on "Linux Runs on Text: Understanding & Handling Text"

rhkramer

Did you really have to break the following line between the last two “/”s?

=
sed ’s/$/\
> /’ sed.in > sed.out
=

I guess I can figure out whether the s “command” should be ‘s/$//’ or ‘s/$/ /’, but you could have made it easier.

Hmm, guess I can’t (or not as easily as I’d hoped):

* first, I guess the “’” is a fancy curly quote or something, as my system (Debian 5.0) seems to balk at it:
sed: -e expression #1, char 1: unknown command: `�’

* second, even replacing the “’” with “‘” (and putting the command all on one line, with the substitution looking like this: ‘s/$//’ doesn’t add a newline after each existing line:

rhk@s17:/rhk$ sed ’s/$//’ sed.in > sed.out
sed: -e expression #1, char 1: unknown command: `�’
rhk@s17:/rhk$ sed ‘s/$//’ sed.in > sed.out
rhk@s17:/rhk$ cat -e sed.out
Line A$
Line B$
rhk@s17:/rhk$

Randy Kramer

PS: Just because I’m on a complaining “roll”, I’ll make one more–why require javascript to post this comment? After reading the article and writing the comment in konqueror (in which I keep javascript turned off (partly because it doesn’t seem to work reliably for all sites anyway, and partly because I don’t want the insecurity of javascript), I had to switch to a different browser to post this comment. (I think I’ll go back to bed and try to get out on the other side.)

Reply
rhkramer

Oops, I’m not sure I needed javascript to post the comment–I have cookies rejected from your site (like most sites)–maybe that was the problem. Anyway, I don’t intend to test it at this time.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>