The Truth About Text

Prospective Linux users often ask, "How does Linux differ from Microsoft Windows?" Depending on the background and interests of the person inquiring, there are a variety of answers to this question. However, the questioner is seldom prepared to understand what is probably the best answer to this question.

Figure One: In a filter chain, the output of one filter is the input of the next.

Prospective Linux users often ask, “How does Linux differ from Microsoft Windows?” Depending on the background and interests of the person inquiring, there are a variety of answers to this question. However, the questioner is seldom prepared to understand what is probably the best answer to this question.

Linux: A World Built of Text

The most significant difference between Linux (and other Unix-like systems) and Microsoft Windows is that under Linux, everything is a file. Moreover, wherever practical, Linux uses text files rather than binary files. How can the humble text file distinguish Linux from Microsoft Windows? Many programs and facilities under both Linux and Windows require configuration information. Under Windows, this information is generally stored in a special binary file known as the registry; under Linux, this information is stored in text files.

Although it is possible to edit the Windows registry by using the regedit program that mimics the functions of a simple text editor, most users do not do so. Windows registry entries are generally terse and cryptic. A casual approach to editing the Windows registry can easily result in a dead system. Instead, Windows users rely on dialog boxes. To access or modify configuration information of a particular sort, you must use a dialog box expressly designed for that purpose.

There are many advantages to the Linux way of doing things. One of the most significant advantages is that when you edit or otherwise manipulate a text file under Linux, the result is a text file; files don’t “become binary” simply because you edit or manipulate them. This often makes it possible to automate and simplify many tasks that you would have to perform manually in a Windows environment. An important Linux facility for automating work is the class of commands known as filters.

Software Plumbing

Linux includes an assortment of commands called filters that manipulate text files. Like a crazed plumber, you can connect a series of filters together in order to manipulate the contents of a text file. A filter reads text passed to it from a file or from the output of a program, modifies the text in some way and outputs the modified text. Because filters read and output text, the output of a filter can be connected to the input of another filter to form a filter chain. If the output of a filter is not connected to another filter, the output appears on the console. Figure One shows a filter chain that processes the output of the rpm command. The output is successively manipulated by grep, sort, and less and then displayed on the console.

It’s easy to construct a filter chain. Start with a command that produces console output. To send the output of the command to a filter, type a vertical bar (‘|‘ – also known as a “pipe”) followed by the name of the filter and any arguments that direct the operation of the filter.

As an example of filtering the output of a command, consider using ls (which produces console output) together with the grep filter (which selectively drops lines from a text stream). Here’s a simple filter chain that displays the names of the subdirectories of the current directory:

ls | grep ‘^d’

Without the help of the grep filter, the ls command would display the names of the files and subdirectories stored in the current working directory. The grep filter eliminates all output lines of the ls command except those that have a d in the first column. If you experiment with ls, you can see that directories are distinguished from files by the presence of a ‘d’ in the first column. So the caret (^) tells grep to look for the d only in the first column; otherwise, files that have a d in their name would be listed with the directories. Both the caret and the d are enclosed by a pair of single quotes that prevent the shell from examining the search pattern, which is intended for grep rather than the shell.

It’s child’s play to construct a more complex chain from an existing one. Simply add another vertical bar (or “pipe”) and filter name at the end of the existing filter chain. For example, you can add a sort filter to the example given earlier by writing something like this:

ls | grep ‘^d’> | sort -r

The -r flag of the sort command causes the output of the grep command to be sorted in reverse order, so that lines beginning with z occur before lines beginning with a.

So what if you want to filter the contents of a file, rather than the output of a command or filter? No problem — just specify the name of the file as the final argument of the filter command. For example, to sort and view the hosts file, you could write:

sort /etc/hosts | less

which would display the contents of the /etc/hosts file in alphabetical order. In case you are concerned, the sort command does not alter the contents of the /etc/hosts file. Instead, the command operates on a copy of the file. Of course, you can use sort to create a new file containing a “sorted” version of the contents in the original file if you wish. Simply direct the output of the command to a new file and replace the original file with the sorted one. For example, you could sort the /etc/hosts file like this:

sort /etc/hosts > /tmp/hosts
mv /tmp/hosts /etc/hosts

As you’ve probably gathered, the greater than symbol (>) directs the output of a program or filter to the named file rather than the console.

Both the grep and the sort commands can be somewhat confusing and have many, many options that allow you to manipulate text in various ways. The man pages are the best place to really learn the ins and outs of all the flags and arguments these commands understand. Take a look at Table One for a list of some of the flags recognized by the sort command. Also, we should examine the grep command.

Table One: Flags and Arguments Recognized by the sort Command

-bIgnores leading blanks in sort fields.
-dConsider only alphanumeric characters (uppercase and lowercase letters and digits) in sort fields.
-fSort lowercase letters as though they were uppercase.
-k pos1[,pos2]Defines a sort field starting at pos1 and ending at pos2.
-nSort according to numerical value.
-rReverse the order of the sort.
-t sepUse sep rather than white space to determine end of sort field.
-uOutput only one line from among a set of identical lines.

The grep Command

Perhaps the most popular and useful filter is grep, whose name is a mangling of the term “global regular expression printer.” If you don’t know what that means, don’t worry about it for now. A regular expression is basically a special type of text pattern, and grep is a pattern matcher that searches for a regular expression in a given stream of text. When used outside a filter chain, grep reads a file and prints lines that match (or do not match) the specified pattern (regular expression). When used as part of a filter, grep examines each input line it receives and selectively passes the line to its output, based on whether the line matches the specified pattern. The grep command takes the form:

grep pattern flags [file]

where pattern is the pattern against which lines are compared, and flags are flags that control grep’s operation. Table Two summarizes the most important flags recognized by grep. The optional argument, file, is not used when grep is part of a filter chain.

Table Two: Flags Recognized by the grep Command

-cPrint only a count of matching lines.
-iIgnore case distinctions in comparisons.
-lPrint only the name of each input file that contains a matched line. Not useful in a filter chain.
-nPrint the line number of each matched line.
-vReverse the sense of the matching so that only unmatched lines are printed.

An important use of grep is scanning a file for lines that match a pattern. For example, to scan the /etc/hosts file for lines that contain the word athlon, issue a command such as:

grep “athlon” /etc/hosts

The matching lines, if any, are printed on the console.

However, the real power of grep comes from the language used to specify the pattern. As we mentioned previously, the pattern is known as a regular expression, which is something like MS-DOS wildcards only much more sophisticated. Table Three summarizes the most important special characters — called metacharacters — which you can use to create patterns. It’s customary, by the way, to enclose patterns in single quotes so that the shell won’t attempt to process the pattern; otherwise, the shell may alter the pattern so that it no longer specifies the operation you desire.

Table Three: Important Regular Expression Characters

.Matches any single character.
?Specifies that the preceding expression is optional and need not be matched.
*Specifies that the preceding expression is optional and need not be matched. Moreover, the expression can be matched indefinitely many times.
+Specifies that the preceding expression is required, but can be matched indefinitely many times.
\xSpecifies that the character x is understood as an ordinary character, even if it is a metacharacter.
[list]Specifies a list of characters, any one of which can be used in matching. For example, the expression [0123456789] matches any digit.
[range]Specifies a range of characters, any one of which can be used in matching. For example, the expression [0-9] matches any digit.
[rangerange]Used to specify multiple ranges of characters, any one of which can be used in matching. For example, the expression [a-z0-9] matches any lowercase letter or digit.
^Matches the beginning of the line.
$Matches the end of the line.
(expression)Groups one or more elements as a single expression. commonly used with a metacharacter such as * , so that the metacharacter applies to the entire expression.

Regular expressions are extremely powerful but can also be incredibly confusing and take a good deal of time to master. It’s a good idea to start practicing now, so let’s check out another example.

Here’s a grep command that finds installed RPM packages with names containing the letter X and lets you page through the result:

rpm -qa | grep ‘X’ | less

Here’s a grep command that finds lines in the /etc/hosts file that do not contain the phrase localdomain.com:

grep -v ‘localdomain.com’
/etc/hosts | less

Here’s a grep command that finds comment lines in the inetd.conf file:

grep ‘^#’ inetd.conf | less

This command depends on the fact that comments generally begin with a hash mark (#) in the first column. Here’s a more sophisticated command that recognizes any line that begins with a hash mark as a comment, even if the hash mark is preceded by whitespace consisting of blanks and tabs:

grep ‘^[ \t]*#’
/etc/inetd.conf | less

And, here’s a grep command that finds lines in the /etc/hosts file that do not begin with a digit:

grep -v ‘^[0-9]‘/etc/hosts | less

Have Fun With Filters

Now that you’ve learned how to use filters such as less, sort, and grep, be alert to ways that you can use them to streamline your work. When you perform a system administration task, try to figure out a way to build a script that will make it easier to perform the task the next time. Filters are likely to be helpful in building such scripts. Both sort and grep include many more options than could be described in this article. Consult the man pages for information on additional options that may be appropriate to your situation.

Next month, we’ll look at the vi editor, which provides a variety of useful editing features probably not found in the point-and-click editors with which you’re familiar. Until then, enjoy working with your new friends, the filters!

Other FIlters

As you might expect, popular Linux distributions include many filters besides sort and grep. Here are some of the most popular:

awkA programming language for processing text.
colrmRemoves columns from its input stream.
columnFormats the input stream into columns.
expandConverts tabs to spaces.
fmtReformats paragraphs of text.
headPrints only the first several lines of the input stream.
nlNumbers lines.
perlA programming language for processing text.
prFormats lines for printing.
sedPerforms text replacement.
tailPrints only the last several lines of the input stream.
trTranslates or deletes characters.
uniqSuppresses identical lines.

Check the man pages for more information about these filters.

Bill McCarty is an associate professor at Azusa Pacific University. He can be reached at bmccarty@apu.edu.

Comments are closed.