Java Regular Expressions

Java's string processing capabilities are far better than those found in C, C++, or Visual Basic, but aren't quite as powerful as, say, Perl's. While the Java String class provides convenient methods such as indexOf(), lastIndexOf(), and startsWith(), you still have to write quite a bit of Java code to do something like break a colon-delimited record into a set of fields. Until now.

Java’s string processing capabilities are far better than those found in C, C++, or Visual Basic, but aren’t quite as powerful as, say, Perl’s. While the Java String class provides convenient methods such as indexOf(), lastIndexOf(), and startsWith(), you still have to write quite a bit of Java code to do something like break a colon-delimited record into a set of fields. Until now.

Java 2 Standard Edition (J2SE) 1.4 adds the power of Perl-like regular expressions to Java. Indeed, even the syntax of regular expressions in the new Java package is nearly identical to Perl’s. This month, we continue our survey of new J2SE 1.4 features and focus on the classes and methods in java.util.regex.

Perls of Java

To get the most out of this article, you should already be familiar with Perl’s regular expression syntax. If you’re not, see this month’s feature on regular expressions, “Hitting the Motherlode,” (pg. 32), before you continue. Having said that, let’s jump right in.

As mentioned above, Java’s regular expressions are nearly identical to the notation used in Perl. You can use all the Perl metacharacters (e.g., ^ ( ) [] $ ? *) and all the Perl character class abbreviations (. for any character except newline, \d for digits, \D for non-digits, etc.). There’s just one small change: since the \ character has a special purpose in Java strings (it’s used to specify a character by its ASCII value in either octal or hexadecimal), you have to use two backslashes in Java regular expressions wherever you’d use one in Perl.

For example, the Java regular expression to match a string of three digits would be \\d\\d\\d. If you want to match a decimal point, the regular expression would be \\.. And if you want to match a backslash, the regular expression you would use is \\\\.

The java.util.regex package introduces two new classes, Pattern and Matcher. Pattern is used to create regular expressions; Matcher performs the actual pattern recognition. Patterns are immutable (once a Pattern has been created it cannot be changed) and all of the state information required to perform a match resides in the Matcher. Hence, many Matchers can share the same Pattern.

Listing One contains a sample program that demonstrates many of the features of the java.util.regex package.

Listing One: Using Java’s regular expression package – Part 1

1 package com.linuxmag.javamatters;
3 import java.util.regex.*;
5 public class RegexExample {
6 public static void main (String args[]) {
7 String passwdLine = “a:b:c:d:e”;
8 String upperLinux = “Linux Magazine”;
9 String jumbledLinux = “lINuX mAGazInE”;
11 Pattern nonDigits = Pattern.compile(“\\D”);
12 Pattern integer = Pattern.compile(“^[+-]?\\d+$”);
13 Pattern decimal = Pattern.compile(“^[+-]?(?:\\d+(?:\\.\\d*)+|\\.\\d+)$”);
15 if (Pattern.matches(“.*Linux.*”, upperLinux)) {
16 System.out.println(upperLinux + ” contains the word ‘Linux’”);
17 }
19 if (upperLinux.matches(“.*Magazine$”)) {
20 System.out.println(upperLinux + ” ends with the word ‘Magazine’”);
21 }
23 Pattern linux = Pattern.compile(“.*(Linux).*”,Pattern.CASE_INSENSITIVE);
24 Matcher m = linux.matcher(jumbledLinux);
25 if (m.matches()) {
26 System.out.println(jumbledLinux + ” contains the word ” + m.group(1));
27 }
29 System.out.print(passwdLine + ” split by ‘:’ is … “);
30 Pattern colon = Pattern.compile(“:”);
31 String[] fields = colon.split(passwdLine);
33 for (int i = 0; i < fields.length; i++) {
34 System.out.print(fields[i] + ” “);
35 }
36 System.out.println();
38 for (int i = 0; i < args.length; i++) {
39 String s = args[i];
41 if ((m = integer.matcher(s)).matches()) {
42 System.out.println(s + ” … is an integer.”);
43 }
44 if ((m = decimal.matcher(s)).matches()) {
45 System.out.println(s + ” … has a decimal point.”);
46 }
47 if ((m = nonDigits.matcher(s)).find()) {
48 System.out.println(“find: ” + s + ” contains a character other than [0-9]“);
49 }
50 if ((m = nonDigits.matcher(s)).lookingAt()) {
51 System.out.println(“lookingAt: ” + s + ” is a string that begins with a non-digit”);
52 }
53 if ((m = nonDigits.matcher(s)).matches()) {
54 System.out.println(“matches: ” + s + ” is a string with only one non-digit character”);
55 }
56 }
57 }
58 }

First, the sample program determines if a string contains the word Linux, attempting an exact match and then a case-insensitive match (lines 15 and 23). Then it tests whether another string ends with “Magazine” (line 19). Finally, it analyzes each command-line argument to determine if it’s an integer, a real number, or a string that contains any non-digit characters (lines 38-56).

Figure One contains the output of this Java program if you run it with the command-line arguments 5, 9.6, B, Ab, 12345a.

Figure One: Output of sample program

Linux Magazine contains the word ‘Linux’
Linux Magazine ends with the word ‘Magazine’
lINuX mAGazInE contains the word lINuX
a:b:c:d:e split by ‘:’ is … a b c d e
5 … is an integer.
9.6 … has a decimal point.
find: 9.6 contains a character other than [0-9]
find: B contains a character other than [0-9]
lookingAt: B is a string that begins with a non-digit
matches: B is a string with only one non-digit character
find: Ab contains a character other than [0-9]
lookingAt: Ab is a string that begins with a non-digit
find: 12345a contains a character other than [0-9]

Working with Patterns

To create a Pattern, we use the static method Pattern. compile(). Lines 11-13 show how to use compile(). (The Perl expressions in the Java code are based on examples in O’Reilly’s “Perl Cookbook.” The backslashes have been doubled to use them in Java.) Line 23 shows a variant of compile() that uses flags to modify how the regular expression is matched against a string. In this case, we specify Pattern.CASE_INSENSITIVE to ignore case.

Regular expressions should be compiled into a Pattern whenever they’re going to be used more than once — such as the loop in lines 38-56 — because it increases performance. As mentioned above, a compiled Pattern can also be shared among many Matchers.

However, if you are only going to use a regular expression once, Pattern provides a method Pattern.matches() for programming convenience. You can see an example of Pattern.matches() on line 15. Also notice line 19. In J2SE 1.4, the String class has a matches() method, too.

Make Me a Matcher, Find Me a Find

Once you’ve compiled a Pattern, you can apply your regular expression to source strings using the Matcher class. Matcher is the engine that compares a string (more specifically, any Java CharSequence) to a specific regular expression.

A Matcher is created from a Pattern by invoking the Pattern.matcher() method. This is shown in line 24 (also on lines 41, 44, 47, 50, and 53), where we create the Matcher m for the String variable jumbledLinux from the Pattern variable linux.

Once created, a Matcher can be used to perform three different kinds of matching: does the entire string match the pattern, does a prefix (a sequence of characters at the beginning of the string) match the pattern, or does the pattern appear somewhere in the string.

  • The matches() method returns true if the entire string matches the pattern. In effect, matches() prepends your pattern with ^ and appends your pattern with $ (if they’re not already there).

  • The lookingAt() method returns true if a prefix of the string matches the pattern. In effect, lookingAt() prepends your pattern with ^ (again, if it’s not already there).

  • The find() method scans the string looking for the next sub-sequence that matches the pattern. It starts at the beginning of the string or, if a previous invocation of the method was successful and the Matcher has not since been reset, at the first character not matched by the previous match.

Listing One includes examples of matches(), lookingAt(), and find(). You can see how the three methods differ by looking at the sample output.

  • The equivalent of Perl’s /\D/ can be found on line 47. find() looks for an occurrence of the pattern \\D anywhere in the string. The strings 9.6, B, Ab, and 12345a all match using find().

  • Line 50 looks for an occurrence of the pattern at the beginning of the string. In effect, lookingAt() is trying to match against the pattern ^\\D. The strings Ab and B match.

  • On Line 53, matches() (with the pattern \\D, the Java equivalent of Perl’s \D) does not match “12345a” because matches() is effectively comparing ^\\D$ to the entire string. However, the string “B” does match.

The Replacements

In addition to pattern matching, the Matcher class can also replace matched sequences with new strings. Matcher provides two methods: appendReplacement() and appendTail(). Both methods work in a less-than-obvious way, so let’s look at an example in Listing Two.

Listing Two: Performing replacements

1 String green = “green”;
2 String quote = “I do not like green eggs and green ham, Sam I Am”;
3 StringBuffer sb = new StringBuffer();
4 Pattern p = Pattern.compile(green);
5 Matcher m = p.matcher(quote);
7 boolean result = m.find();
8 while (result) {
9 m.appendReplacement(sb, “blue”);
10 result = m.find();
11 }
12 m.appendTail(sb);
13 System.out.println(sb.toString());

Given the string “I do not like green eggs and green ham, Sam I Am”, the code snippet in Listing Two prints “I do not like blue eggs and blue ham, Sam I Am”.

(A word of warning: when doing string replacements using Matcher, never modify the input string directly. Instead, build the new string in a separate StringBuffer.)

Lines 1-5 set things up: Matcher m is ready to look for occurrences of the string “green”. The first call to m.find() returns true (since there is an occurrence of “green” in the string). Internally, Matcher m has recorded where the match started and ended, and also where the previous match ended.

The call to appendReplacement() on line 9 does two things. First, it appends every character from the end of the previous match up to the beginning of the current match to StringBuffer sb. Second, it appends the replacement string. So, in this example, the first call to m.appendReplacement() would append “I do not like ” and “blue” to sb.

The next call to m.find() on line 10 finds the second occurrence of “green”. The subsequent call to m.append Replacement() appends ” eggs and ” and “blue” to sb. m.find() fails the third time and terminates the loop. The call to m.appendTail() appends all of the characters from the end of the last match to the end of string. Here it appends ” ham, Sam I Am”.

One note: using the literal regular expression “green” can cause a subtle problem. Specifically, “green” is a substring of “evergreen” and “evergreen” would be changed into “everblue” — not what we want. To avoid the unexpected, we should change the regular expression to look for the whole word “green”. You can try that as an exercise.

Now, Go Regular Expression Yourself

The J2SE 1.4 java.util.regex package makes pattern matching in Java a snap, and you’ll likely be pleased with this new-found power. There are two other features to mention.

Line 23 in Listing One (pg. 48) demonstrates how you can use Perl’s groups in Java regular expressions. If a regular expression contains a group and a match is made, you can use the group() method to extract the substring that matched the group. In the example, the string “lINuX mAGazInE” matched the pattern .*(Linux).* (the match was case-insensitive); group 1 is the substring “lINuX“.

Lines 30-31 show another convenience method. The Pattern method split() works just like Perl’s split: given a source string and a delimiter (a regular expression), split() divides the source string at each occurrence of the delimiter yielding an array of strings. In our example, we create Pattern colon from the literal regular expression : and then call the split() method to divide the String “passwd Line” into an array.

You can find complete JavaDoc documentation for the java.util.regex package online at http://java.sun.com/j2se/1.4/docs/api/java/util/regex/package-summary.html.

Martin Streicher is an Executive Editor at Linux Magazine. Even more impressive, he has a complete set of mint-condition Pee Wee’s Playhouse action figures. He can be reached at mstreicher@linux-mag.com.

Comments are closed.