dcsimg

Linux/Unix Text Processing Tools

I have a large amount of data taken from a number of different databases/mailing lists. I want to pull all of this data into a common format so that I can read it into a spreadsheet. What kinds of tools does Linux have to help me sort through all this data? I know Perl is good at text processing, but I don't want to write a whole program.

I have a large amount of data taken from a number of different databases/mailing lists. I want to pull all of this data into a common format so that I can read it into a spreadsheet. What kinds of tools does Linux have to help me sort through all this data? I know Perl is good at text processing, but I don’t want to write a whole program.

When it comes to text processing, two of the most basic and powerful tools are built into the shell; they are available right from the command line — redirection and pipes. These let you string together a number of smaller commands, which allow you to create very flexible and powerful text processing systems. To shed some light on your question, let’s look at redirection and pipes and the Linux/Unix utilities sed, awk, grep, and cut.

To understand how redirection and pipes work, it is necessary to have a very basic understanding of the Unix and Linux concept of a “file.” Under Unix (and Linux), everything is considered to be a “file.” There are really only two basic activities associated with files: you can read from them and you can write to them. Executable programs are files, and so are text files. The interesting thing is that devices (such as your hard disk, keyboard, and monitor) are also considered files. Viewing all of these things as files means that even a small program can contain code that allows it to easily read or write information to or from any device or any other file.

Each file is given a unique “file descriptor,” which is represented by a number that serves to uniquely identify that file. The three most basic file descriptors are numbers 0, 1, and 2, which are associated with the files Standard Input, Standard Output, and Standard Error respectively. These are abbreviated as STDIN, STDOUT, and STDERR; they represent your input device (usually your keyboard), your output device (usually the serial console), and the standard place to which the system reports errors that may have occurred (also usually the console).

Regardless of the files you or your programs create, every program running on Linux can be assumed to have these three file descriptors already set up and will direct input, output, and errors to those “standard” locations — that is, unless explicitly told to do otherwise.

It’s that “explicitly told to do otherwise” line that forms the basis for the concept of redirection. Redirection is the process of changing the target file descriptor for your program’s input, output, or errors. For a simple example of this, let’s look at the ls command. By default, its STDOUT is defined to be your monitor, so when you type ls all the files in the current directory are displayed on your screen. However, if you want to store that listing for later use you could redirect it to a file by typing something like this:


% ls > filelist.txt

The “greater than” operator (>) is used to redirect ls‘s STDOUT to a file (instead of the screen).

Occasionally, things can get a little confusing. There may be times when you’ll do something like this and get information displayed to the screen anyway. This is because the output is being sent to STDERR instead of STDOUT, and STDOUT defaults to displaying to the screen as well. You can redirect this by typing:


% ls 2> errors.txt

with ’2′ being the file descriptor for STDERR.

You can also redirect STDIN. In many cases, STDIN will be the keyboard. Most programs accept keystroke entries — until you send them an EOF (end-of-file) signal (usually CTRL-D) — and will proceed to do whatever they were intended to do. Take the old Unix mail program for example. You could just type:


% mail Your Message

followed by CTRL-D, and your message would get sent. What if you make a mistake while typing your input? It would certainly be easier using an editing tool like vi or emacs than typing at the command line.

This is exactly what you can do. Type your message using vi (or whatever); when you’re done, save your work to a file. Let’s call ours message. txt. You can then redirect STDIN for the mail program to be your file message.txt. You would simply type:


% mail < message.txt

Meanwhile, redirection has a close relative in Unix — pipes. Pipes are represented by the | character and they basically take the STDOUT from one program and use it as STDIN for another. In Unix-speak, the output of one is “piped” into the other. That’s what happens when you type ls-l| more. The output from ls is made into the more program’s input.

At this point, it seems like we’ve strayed far from the problem at hand, but believe it or not, there is a method to our madness; these redirectors and pipes are what allow us to string several commands together to do fancy text processing.

Let’s say a chunk of your data has name, phone number with area code, job description, and salary information in a file called user_list.txt, which has the format found in Listing One.




Listing One: Sample Data


Joe User, 415-555-1212, Boss 10000
Sue User, 415-555-1213, VP 15500
Tom User, 415-555-1214, Loader 21000
Ed Loser, 415-555-1215, Horse 100
Lucy User, 415-555-1216, VP 11000

Linux has tons of text processing tools that you can use in conjunction with redirection and pipes to manipulate this data. For example, if you only wanted to get out the phone numbers, you could use the cut command.

You would type something like this:


% cat user_list.txt | cut -d “,” -f 2

This would “pipe” the output from cat (which would otherwise just do a screen dump of user_ list. txt) to the program cut. Cut tells it to use comma-delimiters (-d “,”) and the second “field” (-f 2); in this case, the text is broken up into three fields separated by the two comma-delimiters. This should produce the following output:


415-555-1212
415-555-1213
415-555-1214
415-555-1215
415-555-1216

Now let’s say you wanted to store that output in a separate file called numbers.txt. You could do this easily by issuing the command:


% cat user_list.txt | cut -d “,” -f 2 > numbers.txt

You could get fancy and pull out just the Vice Presidents’ phone numbers. The best way to do this would be with the grep command. Grep stands for “Get Regular Expression,” and it’s basically a pattern matcher (a regular expression is another name for a text pattern). You would do this by typing:


% grep VP user_list.txt | cut -d “,” -f 2 >vp_numbers.txt

vp_numbers.txt would contain only these numbers:


415-555-1213
415-555-1216

The default behavior of grep reports each line that matches its first argument. You can also modify the behavior of grep using regular expressions, but that is a whole different topic. grep is one of those commands where you’re better off referring to the man page.

If you really wanted to tweak things, you could further trim your output so that only the last seven digits of each phone number would appear. You could issue the following:


% grep VP user_list.txt | cut -d “,” -f 2 | cut -d”-” -f2,3

Now that we’ve messed around with cat, cut, and grep, let’s check out another old-school Unix text-tool — sed, the so-called “stream editor.” Sed is a very complicated and multi-functional tool. If you think of it as vi for the command line, and not interactive, you have got the right idea.

Sed does lots of cool things to text, but let’s look at an example that’s relevant to what we’re already working on. Let’s say all your Vice Presidents had their title changed to “Senior Assistant Bozo” (or something like that). You could change each of them by hand or with sed. The command would look like the following:


% cat user_list.txt | sed -e ‘s/VP/Senior Assistant Bozo/g’

This should display the contents of user_list.txt to STDOUT. To store the output as a file, you can redirect it to a new file name or back to the original, and it would store the replacements permanently. You would use:


% cat user_list.txt | sed -e ‘s/VP/Senior Assistant Bozo/g’ > newlist.txt

The sed command accepts a great number of arguments, one of which (-e) tells sed to “execute” a script. However, in this case, we specify the commands on the command line rather than in a script. So, the -e argument is followed by the sed “script” — ‘s/VP/ Senior Assistant Bozo/ g’.

Again, we are treading close to the topic of regular expressions. But in a nutshell, here’s what this command is doing. The s tells sed to substitute the first string found between the slashes (in this case ‘VP‘) with the string found between the second and third slashes (in this case, ‘Senior Assistant Bozo‘). The g tells sed to make this replacement globally, because sed‘s default behavior would just modify the first instance found.

This is just a tiny example of what sed can do. For more on this, you really have to read the man page. Still, the concept of how sed works is fairly straightforward; generally, it takes an input stream, modifies it by your rules, and creates output. Tie that together with grep, cut, cat, and redirection and you start to see the power of all these tools.

The last text processing tool we’ll touch on is awk, or in this case, gawk (the GNU version). awk is essentially a whole programming language, but it can also be used easily in a limited way (much like sed).

Let’s say you want to find all of the people in your user_list.txt whose salaries are greater than $15,000. Let’s assume that you know that salary information is stored in the fifth field of the text file and that names are stored in the first field. You could do something like this:


% cat user_list.txt | awk ‘$5 > 15000′

This will print the fifth field ($5) if it is greater than 15,000. Perhaps you only want the name of those that match to be printed. To do this, use the command:


% cat user_list.txt | awk ‘$5 > 15000 {print $1,$2}’

This will print:


Sue User,
Lucy User,

Just to be neat, let’s say you want to strip out the trailing commas. Run it through sed like this:


% cat user_list.txt | awk ‘$5 > 15000 {print $1,$2}’ | sed -e ‘s/,//g’

We can barely scratch the surface of each of these commands in a column of this length (particularly gawk). Still, I hope you see how using all these pieces together can transform any sort of crazy ASCII text into usable formatted data. You can use these tools to transform disparate data from various sources into a nice comma-separated text file, which can then be freely read into any spreadsheet.

Hope this helps.





Gaylen Brown is a senior consultant at Linuxcare, Inc. He can be reached at tech@linux-mag.com.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62