Hitting the Motherlode

Mining data for information?


Among computing flatfoots — DBAs, Web application developers, shell script programmers, data miners, system administrators, and that ilk — the terms delimiter, pattern, and munge (pronounced “munj”) are common and well-known. Indeed, these and other even more mysterious phrases — backslash dee, caret dollar, and slash gee – get thrown around so much they must sound like gibberish to the uninitiated.

No, these expressions aren’t the refined phrasings of the cognoscenti, nor are they passwords of some secret society. They’re slang — shop talk about one of computing’s handiest tools, regular expressions (or regex).

In a nutshell, regular expressions are a powerful, compact, and concise shorthand for describing patterns of text. The shorthand, perhaps cryptic at first, is a hyper-minimalist solution to a big problem: “does this data conform to this format?”

As it turns out, recognizing and parsing text are two of the most common problems solved with a computer. Think about it. Almost every computer application munges (“munge” means “take all of this data and extract something meaningful out of it”) data in some way. Word processors search documents to find and correct double words (e.g., “the the”); email applications deconstruct messages to extract mail headers (e.g., “From”, “To”, “Bcc”); Web applications parse form data to validate input fields (e.g., a zip code or a telephone number).

If you have to munge data, regular expressions are indispensable. Indeed, once you know how to write regular expressions, you can write them in a fraction of the time it would take you to create equivalent parsing engines. Armed with regular expressions, you’ll solve harder problems, write more efficient solutions, and wonder how you ever survived without them.

A Word About PERL

All the examples we’ll look at use Perl. We chose Perl because it’s included with almost every Linux distribution and its regular expression syntax is a de facto standard. If you want to use another programming language (e.g., Python, Java, or Ruby) or one of the Linux text processing utilities (e.g., awk or grep), what you learn here will translate easily.

If you’ve never used Perl, don’t worry. We’ll only use a minimal subset of it. The most important Perl operator we’ll use is =~, the comparison operator, which tests a string against a regular expression. For example, let’s say we want to determine if a string contains the letter J. The Perl regular expression to use is /J/ (in Perl, a regular expression is commonly bounded by the / character). The command below shows the use of =~.

% perl -e ‘print “Match\n” if “Jeff” =~ /J/;’

This command prints Match because the letter J appears in the string Jeff.


A regular expression is a sequence of characters that describe a set of strings.

The simplest regex is a literal character, such as a. The regex a (expressed as /a/ in Perl) matches any string with an a in it.

Of course, searching for a single letter is not that interesting. You can combine literal characters to search for literal strings. For example, if you want to search for a particular string, say pizza, you’d use the regular expression /pizza/. /pizza/ means “look for a p followed by an i followed by a z followed by another z followed by an a. The string pizza matches /pizza/ as well as pizza delivery guy; Tower of Pizza does not match because P is not the same as p.

Table One: Common regular expressions literals

Table One shows some of the literals you can use in regular expressions.

The characters \, +, and * (and others) have special meanings in regular expressions. So, to match a literal +, you have to precede it with a backslash. Table One shows examples of other special literals.

The notation [], shown in Row 5 of Table One, is used to define a character class, or a set of literal characters. A character class is a shorthand for “match any one of these literal characters.” For example, the notation [aou] in Row 5 means “either a or o or u, but nothing else.” So, the regular expression /c[aou]p/ matches all strings that “have a letter c followed by an a, o, or u, followed by the letter p.” At a minimum, the strings cap, cop, and cup match the pattern; it is also found as substrings in cape and copper.

In fact, [] is used so often that certain languages have defined special notations for frequently-used character classes. For example, Perl uses the notation \d for the character class [0123456789] ([0123456789] means “any digit”).

You can also create a character class that excludes characters. For example, let’s say you want a character class for consonants. You could write it as [bcdfghjklmnpqrstvwxyz], but it’s easier to simply write [^aeiou]. The ^ (caret) used inside square brackets means “not any of …”. So, the character class [^aeiou] would match any character (not just any letter) except a, e, i, o, or u. If you use the caret, it must be the first character in the character class.

Finally, if you want to create a character class from a contiguous range of literal characters, you can use the range operator -(a hyphen). For example, the character class [0-9] is the same thing as [0123456789]. [a-z] is all lowercase characters; [A-Z] is all uppercase characters.


While all regular expressions use literals, you can’t process complex text patterns with literals alone. Consider the following problem statement:

Parse a string where individual fields are separated by commas. However, commas can appear in quoted strings; in that case, the comma shouldn’t be treated as a separator. Finally, a quote can be embedded in a quoted string by doubling it.

In other words, we want to take a string like the following: one,two,”three “”little”" monkeys”,four,”five, yes”

and break it up into the strings one, two, three “little” monkeys, four, and five, yes (the string five COMMA SPACE yes).

To solve this problem, you need to be able to express notions such as “optional”, “this or that”, “one or more”, “exactly this many”, and “not”. Like mystic runes that mean more than they seem, the regex metacharacters allow you to describe more complex patterns.

Table Two: The metacharacter groups

There are generally fourteen metacharacters: [ ] \ . | ^ $ ( ) + ? { } *. You’ve already seen [] and one use of the backslash (\) character. . (“dot”) has a special purpose. The other metacharacters are used for alternation, anchors, and grouping, and as quantifiers.

Table Two summarizes the purpose of each group of metacharacters and the following sections describe how to use each group in more detail.

Matching Anything (Without Writing Everythying)

Every now and then, you’ll want your regex to match a character — any character (except newline). Perl provides the “dot” metacharacter (.) as a shortcut for the character class [^\n].

If you want to try “dot”, run the following commands. The commands print J_ff if the string matches the pattern given in the regex.

% perl -e ‘print “J_ff\n” if “Joff” =~ /J.ff/;’
% perl -e ‘print “J_ff\n” if “Jeff” =~ /J.ff/;’
% perl -e ‘print “J_ff\n” if “Jff” =~ /J.ff/;’

Note that . matches only one literal character. For example, the regular expression m.n, matches the set of strings that have an m, one other character (that’s not a newline), then an n. Strings like man, men, m7n, and woman will match. Strings like mean and mn will not.

Quantifiers: Repweating Yourself (Without Repeating Yorself)

Imagine how difficult it would be if you had to write /[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z]/ to match four letters in a row. That’s why quantifiers exist — they describe “how many.” For example, the terse regex /[a-zA-Z]{4}/ is the equivalent of /[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z]/.

Table Three: The quantifier metacharaters

Table Three lists all of the valid quantifiers in Perl and shows examples of how each is used.

There are four quantifiers: +, ?, {}, and *.

  • The + quantifier means “one or more occurrences.” The Perl pattern match “Jeff” =~ /a+/ fails because there is not at least one consecutive a in Jeff. However, the comparison “Barry” =~ /a+/ is true.

  • The ? quantifier means “zero or one occurrence,” or in other words, “optional.” To match both boat and bat, for example, you can use the regex /bo?at/ — the a is entirely optional.

  • The {n,m} quantifier is very versatile and comes in three forms. First, {n} means “exactly n occurrences.” /jef{2}/ matches jeff. Second, {n,} matches at least n times; /jef{2,}/ matches jeff, jefff, jeffffff, etc. Third, {n,m} matches at least n times, and no more than m times; /jef{2,4}/ matches jeff, jefff, and jefff. (? is the same as {0,1}, and + is the same as {1,}.)

  • The * quantifier means “zero or more occurrences,” or in other words, “any number of occurrences, including no occurrences.” * often causes confusion because it can successfully match by matching zero times. “Jeff” =~ /a*/ returns true (oddly enough) because Jeff does contain at least zero consecutive “a”s (there are zero “a”s at the front of the string, right before the J). Even more surprising, though, is that “Barry” =~ /a*/ returns true, but only because there are zero “a”s at the front of that string!

It’s important to note that a quantifier cannot appear by itself — it always qualifies a literal or character class. Also, quantifiers “bind very tightly”: /ab*c/ matches an a followed by zero or more bs followed by a. For example, in the regex /boy?/ the ? applies only to y. (If you wanted ? to apply to boy, you’d have to use grouping — some examples in a bit.)

Alternation: Decisions, Decisions

It would be a real hassle to have to use more than one regular expression to determine if a string has a J or a j in it. While you could use a character class (i.e., [Jj]), you wouldn’t be able to easily express something like “does this string have the sequence c-a-t or the sequence d-o-g?”

That’s what alternation provides: it lets you say “this regex or that regex.” Alternation is specified with the | (pipe) character. For example, /J|j/ means “match a J or a j“. dog|cat means “match the sequence c-a-t or the sequence d-o-g“.

You can use more than one alternation in a regex. Take a look at the following:

% perl -e ‘print “I know you.\n”
if “Jeff” =~ /Jeff|James|Ann/;’
I know you.
% perl -e ‘print “I know you.\n”
if “James” =~ /Jeff|James|Ann/;’
I know you.
% perl -e ‘print “I know you.\n”
if “Ann” =~ /Jeff|James|Ann/;’
I know you.
% perl -e ‘print “I know you.\n”
if “ann” =~ /Jeff|James|Ann/;’

It’s important to note that the | metacharacter binds “weakly.” The regex /Ant|Man/ does not mean “match Antan or AnMan” — the | isn’t a choice between t or M. Rather, it’s a choice between the two sequences A-n-t and M-a-n.


In all our examples so far, we’ve searched for text matches anywhere in a string. For example, the simple regex /J/ looks for the letter J anywhere in a string. But what if we wanted to ensure the string started with the letter J or ended with the letter f? For those kinds of comparisons, we need to use anchors.

Just like a real anchor, regex anchors fix a position: they limit where a regex can match in a string. The two regex anchors you’ll see most often are ^ (caret) and $ (dollar sign), which match the beginning and end of a string respectively. For example, the regex /^J/ will match any string that starts with J, such as Jeff or James, but not Abdul Jamal. Likewise, /f$/ matches any string that ends with f, such as Jeff or Biff, but not Jeffrey or Buffy. Both anchors can be used together — /^Jeff$/ matches only the string Jeff.


Quantifiers are nifty, and so is alternation, but we’re still missing something.

Consider, for example, how you might solve the following problem: match J-e-f-f-r-e-y or J-e-f-f or J-a-m-e-s. Sure, /Jeffrey|Jeff|James/ works. So let’s make the problem a little harder: match J-e-f-f-r-e-y or J-e-f-f or J-a-m-e-s followed by zero or more occurrences of the sequence, SPACE-J-r-..

How can we match strings like Jeffrey and James Jr. and Jeff Jr. Jr.? Clearly, alternation only goes so far. What we really want to express is that the entire sequence SPACE-J-r-. is optional. That’s what the () (parentheses) metacharacters are for — grouping.

Grouping allow us to partition off part of the regex as a sub-pattern and treat the entire sub-pattern as as one big symbol. Using grouping parentheses, quantifiers, and alternation, you can express arbitrarily complex patterns.

To solve the problem above, use the regular expression /(Jeff(rey)?|James)( Jr\.)*/. The first regular expression, Jeff(rey)? matches Jeff or Jeffrey because the sequence r-e-y can appear once or not at all. Therefore, the first grouping, (Jeff(rey)?|James), matches Jeff, Jeffrey, or James. The sub-pattern ( Jr\.)* means “zero or more sequences of SPACE-J-r-.“. Remember that . (“dot”) is a metacharacter; to use it as a literal, you must “escape” it with the backslash.

A Practical Example

Let’s use our new regex powers to write a small application that looks for email spam. To find spam, we’ll read an email message, extract the mail headers (e.g., “From”, “To”) and look for obvious spam phrases like FREE! and BUY NOW!!!

In Linux, each incoming email message has two sections, the email headers and the body, and the sections are separated by a blank line (a line consisting solely of a newline). Header lines look like Header: Value, where Header and Value can be any string, but Header will not contain a colon. For simplicity, we’ll assume that our email messages do not have duplicate headers (such as “Received:”) or headers that span multiple lines.

Here’s a program that does the trick:

while (<STDIN>) {
last if /^$/;
my ($type,$value) = /([^:]+): +(.*)/ or next;
print “From Hotmail\n” if $type =~ /^From$/
and $value =~ /\@hotmail\.com$/;
print “SPAM!\n” if $type =~ /^Subject$/
and $value =~ /FREE|!{2,}|STRONG BUY/;

That’s a bare-bones email-header parser. There are far more robust solutions out there, but this is a simple example of how to put regexes to work.

The History of Regular Expressions

The theory behind regular expressions was formalized in the 1950s by Stephen Kleene. Kleene first used the term “regular expression” and developed formal definitions of finite state machines, the hearts of regex engines.

In Unix, regex engines have been around since the early days at Bell Labs. Early text editors provided regular expressions as a means of finding text and making changes without having to search manually. Utilities such as grep, sed, and awk appeared in the 1970s and have remained in distributions to this day — although modern versions are much easier to use than early incarnations.

Initially, grep, sed, and awk had widely varying features and different syntax, but in 1986, Henry Spencer changed all that. Spencer’s regex programming library defined a standard syntax and a standard set of features. With Spencer’s package, programmers could use regular expressions in their own applications, yet leverage the experience users already had with existing tools. Much of Spencer’s work persists to this day.

Perl used its own regex library until version 2, when it began to use a derivative of Spencer’s creation. Since then, Perl’s derivations have grown in size and power, and Perl’s regex engine is capable of much more than either Spencer and Kleene originally envisioned. Perl is the new standard and many other languages (including Java, Python, and Ruby) and utilities (modern grep and Emacs) now use PCREs (“Perl-compatible regular expressions”) — a subset of Perl’s syntax.

Behind the Curtain: Regex Engines

As we mentioned at the outset, a regular expression is a convenient shorthand for what would otherwise be some very complex pattern matching code. However, regular expressions are not magic. Embedded in every regex tool is a complex bit of software that interprets the regex and performs the pattern matching.

There are two kinds of regex engines: text-directed and regex-directed. (The terms “text-directed” and “regex-directed” were first coined in the O’Reilly book, “Mastering Regular Expressions” by Jeffery Friedl, an excellent reference for all things regex.) Both types of engines can perform pattern matches, but they work differently and can sometimes produce different results.

To understand the differences between a text-directed and a regex-directed engine, let’s see what happens when each tries to match the string the annual income with the regular expression /an(na|ts|nual)/.

The Text Directed Engine

In a text-directed engine, the process of matching a given string against a regex is controlled by the string. A text-directed engine starts at the first character in the string and makes decisions (made a match, match failed, some possibilities are no longer valid, etc.) as it sequentially processes each character in the string. And, because the text-directed engine is controlled by the string, it can track all possible “paths” through the regex simultaneously.

For example, given the string the annual income and /an(na|ts|nual)/, the text-directed engine would begin at the letter t in the string and start matching the string against the regex. Obviously, the characters t, h, e, and SPACE do not match the regex. Let’s pick up where the first match is made.

Figure One: How a “text-directed” engine compares strings

Figure One shows how the text-directed engine would progress. In the figure, the <> marker represents a pointer; the “State of String” and “State of Regex” columns show the position of the pointer in the string and regex, respectively.

After an in the string matches /an/ in the regex, a text-directed engine would try to match the next letter, n with all three branches of the /(na|ts|nual)/ regex at the same time. n does not match t so that eliminates one option. (Whenever a string does not match a possible path, that path is “short-circuited” and is no longer considered.) n does match the n in both of the remaining regular expressions; those regexes remain viable “paths” to pursue.

To a text-directed engine, the regex /ab+c/ and the regex /[a-a](bb*c|b+c)/ are the same, because the characters in the string control where the “pointer” in the regex is located.

Regex-Directed Engines

A regex-directed engine is the “opposite” of a text-directed engine. A regex directed engine starts with the regex and tries to find text that matches.

Consider our example once again. Given the regular expression /an(na|ts|nual)/ and the string the annual income, and skipping over the initial letters that do not match, the regex-directed engine starts its processing with /a/ (the first literal in the regex). /a/ matches the a in annual. Similarly, /n/ matches n.

The next step is to match the regex (na|ts|nual), which means “match n-a, or t-s or n-u-a-l, in that order”.

First, the engine tries to match /n/ and then /a/, but fails because annu does not match /anna/. Next, the engine tries /ts/. This one fails immediately because t does not match n (the second n in annual. Finally, the engine tries /nual/, and matches.

Figure Two: How a “regex-directed” engine compares strings

Figure Two shows the progress of the regex-directed engine as it processes our example.

Speed Differences

A text-directed engine and a regex-directed engine also differ in both compilation and execution speed.

A text-directed engine is slower to compile a regular expression because it maintains a rather large data structure of states — it has to know where to go for each character the regex consumes. However, because of this data structure, a text-directed engine matches very quickly. The execution speed of a text-directed engine is not related to the regex, but rather to the string it is matched against.

A regex-directed engine compiles regular expressions rather quickly, on the other hand, because it only compiles information about the regex itself, not every possible move between states. However, because of this sub-optimal compilation, the actual execution is slower than that of a text-directed engine. A regex in a regex-directed engine will try every combination of itself on the string before giving up entirely. (Consider the regex /(a+)+b/ on the string aaaac. Since the regex-directed engine doesn’t realize that there is no b in the target string, it will go through all the permuatations of (a+)+ it can on aaaa, which is quite a lot.)

Alternation Differences

The most noticeable difference between text-directed and regex-directed engines (other than the blinding speed of a text-directed engine) is the way alternation is processed.

A regex-directed engine will examine options of an alternation left-to-right; a text-directed engine doesn’t care since it compares all of the matching options of an alternation at the same time.

A text-directed engine is really in more than one place at a time. Moreover, a text-directed engine chooses the longest successful match it finds earliest in the string, whereas a regex-directed engine chooses the first successful match. This is an important difference.

Matching /jeff|jeffrey/ against the text jeffrey in an regex-directed engine will only match jeff — that option is successful, and the regex ends. A text-directed engine would match jeffrey, since it’s the longer of the two. /jeffrey|jeff/ would match jeffrey in both.

The Future

So, what does the future hold for regular expressions? Well, the Unicode specifications allow you to embed set logic into character classes. Why is this important and how is it useful? Consider the character class of all letters except vowels. In Perl, you’d have to write [b-df-hj-np-tv-zB-DF-HJ-NP-TV-Z]. That’s difficult to write.

With Perl 6, you’ll be able to use the syntax [a-zA-Z ^^[aeiouAEIOU]]. That might look noisy, but it’s easy to explain: “the characters from a to z and A to Z, except a, e, i, o, u, A, E, I, O, and U“. Consider too, all digits except 5: /[0-46-9]/, which can be [\d^^5].

You’ll also be able to calculate the intersections of character classes. If you have two variables representing classes, and you want to match the characters they share, you can use [$this&&$that]. The fact that this can be done inside a regex (rather than constructing a separate character class manually from the two) makes you do less work, and leaves the work to Perl.

Perl’s regex engine boasts the most features of any regex implementation. There are dozens of assertions, such as a “look-ahead” ((?=pattern)) that lets you match part of a regex without advancing in the string, and arbitrary code evaluation ((?{ code })) that lets you execute any block of code at a given point in a regex. These are beyond the scope of this article, but the Perl documentation has plenty of details in perlre.

Regexes will always exist in some form, whether it be cryptic (/(?>\t+)/) or simple (<grab:<tab>+>/). The need for matching text will never disappear.


A brief history of regexes: http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?regular+expression

“Mastering Regular Expressions,” Jeffrey Friedl, O’Reilly, 1997.

The Perl Mongers: http://www.perl.org

O’Reilly’s official Perl Web site: http://www.perl.com

Documentation describing Perl’s syntax and regex features

man perlre

man perlsyn

Jeff japhy” Pinyan has been programming Perl for five years. He’s an active member of the DALnet #perl channel, the http://www.perlmonks.org community, and the Perl5-Porters. He’s working on a regex book for Manning, to be completed later this year. He can be reached at japhy@pobox.com

Comments are closed.