What’s the diff?

Get more control over how file differences are found and displayed with some lesser-known options, and other techniques for getting the output you need.

There are good graphical file-comparison programs available. They aren’t always the best tool for comparing files, though. One reason to use the command-line diff is that it can give you much more control over how differences are found and displayed.

You’re probably familiar with diff(1). We’ll look at some details of how diff works, some lesser-known options, and other techniques for getting the output you need.

This is the first of a series about diffutils, an important package of tools.

Different diffs

You might be surprised to know that, given two files, diff may not always give the same set of differences. As an example, Table One shows two snippets of C code in files named 1.c and 2.c.

Table One: Two short files

File 1.c File 2.c
a *= b; c *= d;
b *= c; b *= c;
c *= d; a *= b;

GNU diff version 2.8.1 shows that the first two lines of 1.c were deleted, and that two new lines were added to 2.c:

$ diff 1.c 2.c
1,2d0
< a *= b;
< b *= c;
3a2,3
> b *= c;
> a *= b;

But you might prefer to think that the first and third lines were replaced, while the second line is the same. This would be a valid diff output, too:

$ hypothetical_diff 1.c 2.c
1c1
< a *= b;
---
> c *= d;
3c3
< c *= d;
---
> a *= b;

Obviously, there's more than one way to represent the differences between two files. The standard diff algorithm will usually do a good job. There are times, though, that you might want another opinion:

  • The option -d or --minimal tells diff to "try harder" to find the smallest set of changes; that makes diff slower.
  • If two large files have a few widely-spaced changes, the GNU option --speed-large-files can make diff run faster.

Hunks, whitespace

diff works by finding groups of lines that are the same in both files. Between those groups of similar lines are hunks: groups of lines that differ.

By default, when diff is finding common lines and hunks, it looks at every character on a line. That includes whitespace characters: spaces, tabs, newlines, and carriage returns (which can come from Microsoft systems that use a CR before every NL). If diff shows a difference between two lines that look identical to you, the difference might be in the whitespace. Here are two ways to deal with that:

  1. Pipe diff's output to cat -te (which you may need to type as cat -t -e on older versions of cat). This shows TABs as ^I and marks the ends of lines with $, making it easy to see what's different in the whitespace.
  2. Use a diff option that tells it to ignore some or all differences in whitespace. GNU diff has several of these:
    • The -E option treats a TAB the same as the equivalent number of spaces.
    • Use -b to treat every sequence of whitespace the same, and also to ignore whitespace at the ends of lines.
    • The -w option ignores every whitespace character completely. (So the words out side would compare equal to outside.)
    • To ignore completely-empty lines (that is, a line that's nothing but a newline character), use -B.

    For more information, use man diff or, better, info diff.

Ignoring case, ignoring certain lines

The -i option makes upper-case and lower-case letters compare equal. So a line in one file, and A lInE in the other file, wouldn't be part of a hunk.

If there are particular lines that diff should ignore completely, pass them to diff with the -I option. Listing One has an example. We use cat -n to show line numbers on the two files. Without -I, diff shows changes to the headings. With -I HEAD, diff ignores lines that contain the string HEAD in both afile and bfile -- which, in this case, are all of the headings.

Listing One: Ignoring certain lines

$ cat -n afile
     1  line a
     2  line b
     3  A HEADING
     4  line c
     5  line d
$ cat -n bfile
     1  A HEADING
     2  line a
     3  line b
     4  ANOTHER HEADING
     5  line c
     6  line d
$ diff [ab]file
0a1
> A HEADING
3c4
< A HEADING
---
> ANOTHER HEADING
$ diff -I HEAD [ab]file
$

Note that the argument to -I must match the corresponding line in both files. If a line exists in both files, but the pattern matches only one of them, diff shows that line as a difference. Here, for instance, the pattern ANOTHER only matches in bfile -- not the corresponding line in afile -- so diff outputs that line:

$ diff -I ANOTHER [ab]file
0a1
> A HEADING
3c4
< A HEADING
---
> ANOTHER HEADING

You can also use grep-style regular expressions. For example, here's how to ignore lines starting with an upper-case letter: the regular expression ^[A-Z] or, more portably, the character class ^[[:upper:]]:

$ diff -I '^[[:upper:]]' [ab]file
$

You can specify multiple patterns to ignore by using multiple -I options.

Ignoring line breaks

diff is line-oriented: it compares entire lines (strings ending with newline characters). If you edit a file with an editor that re-formats lines -- moving the newlines to different places -- diff will show differences where there are none (other than the newlines).

One way to solve this is by getting the diff front-end named wdiff, which was first written around 1990. It breaks lines into separate words, then compares the words with diff. wdiff also has some features that are handy for interactive browsing of differences.

If you don't want the complexity of wdiff, you can do something similar yourself by replacing all sequences of tabs and spaces with single newline characters. This puts each word on a separate line -- so diff can find groups of common words without being confused by the differences in newline characters. The command tr -s '\t ' '[\n*]' does the job. (Note the space after \t in the first argument.)

Listing Two shows an example. file1 has four words, and cat -te file1 shows that there are both spaces and TABs between the words. (Word processors may add multiple spaces between words. We've tossed in some TABs for good measure.) Note that tr can't accept a filename as an argument; you have to use the shell's < character to take tr's standard input from that file. The output of tr has one word per line.

Listing Two: Showing word differences

zsh% cat file1
this    is      a
test
zsh% cat -te file1
this    is^Ia ^I $
test$
zsh% tr -s '\t ' '[\n*]' <file1
this
is
a
test
zsh% cat file2
this is
a test too
zsh% diff =(tr -s '\t ' '[\n*]' <file1) =(tr -s '\t ' '[\n*]' <file2)
4a5
> too

We're using Z shell process substitution to run tr on both file1 and file2, then pass the results to diff as two temporary arguments. (The article Catching Some ZZZs, available online at http://www.linux-mag.com/id/1579, introduces process substitution in the Z Shell. Bash process substitution is similar, but it doesn't have the zsh operator =() -- which we might need in case diff tries to rewind its file inputs.)

You could also run tr twice, saving its output in two temporary files, then compare those files:

$ tr -s '\t ' '[\n*]' <file1 > temp1
$ tr -s '\t ' '[\n*]' <file2 > temp2
$ diff temp[12]
4a5
> too
$ rm temp[12]

Custom comparisons

You're a judge in the International Obfuscated C Code Contest (http://www.ioccc.org/). You'd like to see the differences between two versions of a program that's written as "ASCII art," in the shape of a smiley face.

"Bits and pieces: Comparing binary data (and more)" showed the bdiff script. It compares binary files character-by-character using od to show visible representations, one character per line -- and diff to compare them. For text files, we don't need od.

Since whitespace generally doesn't matter in C programming, you use a little sed script to put each character on a separate line, then remove all spaces and TABs. Listing Three shows the script named sedscr. The first command spans two lines. The square brackets contain a space and a TAB. You run both files through sed, then use the unified-format diff option -u to show three characters of context around each hunk:

Listing Three: Un-obfuscated diff

$ cat sedscr
s/./&\
/g
s/[     ]//g

$ sed -f sedscr old.c > old.tmp
$ sed -f sedscr new.c > new.tmp
$ diff -u old.tmp new.tmp
--- old.tmp  ...
+++ new.tmp  ...
@@ -731,6 +731,7 @@
 ?
 9
 1
+0
 <
 x
 :

Somewhere around the 734th non-whitespace character, the new version of the file has an added 0.

You might have done better with a sed script that breaks text into separate lines at semicolons (;) -- the C statement separator -- and also removes whitespace. Once you see the technique, though, you can tweak the sed script, or use a different method, to eliminate text that obfuscates diff -- and find the real differences.

Which function or section?

If a program file contains multiple functions, it can be useful to know which function has changed. The GNU diff option -p "understands" C language syntax enough to keep track of each function it's reading. When it finds a change, it outputs the first line of the function at the start of the difference listing. For example (using -C 0 to suppress context output, to save printing space):

$ diff -C 0 -p prog[12].c
*** prog1.c     ...
--- prog2.c     ...
*************** int calc (int a) {
*** 48 ****
!       c *= c;
--- 48 ----
!       b *= c;
*************** void do_output() {
*** 120 ****
!       fprintf (stderr, "OUTPTU:");
--- 120 ----
!       fprintf (stderr, "OUTPUT:");

If you aren't comparing C code, the option -F regexp uses a line matching regexp as the heading. For instance, in HTML code that has header lines tagged with <h1>, <h2>, and so on:

$ diff -C 0 -F '^<[hH][1-5]>' a.html b.html
*** a.html      ...
--- b.html      ...
*************** <h1>Summary</h1>
*** 19 ****
! <li>Yo-yos and pop gums,</li>
--- 19 ----
! <li>Yo-yos and pop guns,</li>
*************** <h3>Pop Guns</h3>
*** 47 ****
! Large grene $1.49
--- 47 ----
! Large green $1.49
Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62