Finding Similar Images

I admit it. Like anyone else with a decent-speed connection to the Internet, I collect a lot of images. For example, a few months ago, I described a program that looks through Yahoo! news images for pictures of Oregon and some of my favorite singing stars. Sometimes, an image travels multiple paths before it ends up on my disk, and thus gets saved under different names. But that's a waste of disk space, so I want to eliminate duplicates where I can.

I admit it. Like anyone else with a decent-speed connection to the Internet, I collect a lot of images. For example, a few months ago, I described a program that looks through Yahoo! news images for pictures of Oregon and some of my favorite singing stars. Sometimes, an image travels multiple paths before it ends up on my disk, and thus gets saved under different names. But that’s a waste of disk space, so I want to eliminate duplicates where I can.

At first, I was using a simple tool to compute the MD5 hash of each image in my collection. That technique easily eliminated exact duplicates, but frequently, the same image was different enough to confound the tool. For example, if the original image had been scaled, re-rendered at a different JPEG quality, or converted from JPEG to PNG, the actual bits are different and the MD5 hashes differ, even though the image is nearly the same on my screen.

So, a few days ago, I set out to write a program that could find similar images, not just identical images. My strategy is to reduce each image to a 4-by-4 grid of RGB values, yielding a 48-number vector of values from 0 to 255. Regardless of the re-rendering or resizing of the image (or even minor touchups), the vector should be identical (or close) for images with the same original source. After a few hours of experimentation and tweaking, the results of my work can be seen in Listing One, below.

Lines 1 through 3 start nearly every program I write, turning on warnings, compiler restrictions, and disabling the buffering of STDOUT.

Line 5 pulls in the move routine defined in the core File::Copy module. I’ll be using this to rename any corrupt images, an interesting outcome of having to scan them anyway.

Line 7 pulls in the CPAN-located Image::Magick interface, also known as PerlMagick. I’m not sure why I have such a hard time installing and using image libraries, but Image::Magick is certainly a prime example of a finicky-to-install and vastly underdocumented image manipulation module. However, when you get it to work, it is indeed “Magick.”

Line 8 pulls in the Cache::FileCache module, part of the Cache::Cache distribution found in the CPAN. Because the conversion of an image to its vector can take time, I cache the results. The cache key is computed as something that won’t change even if I rename the item on the disk, which makes it easy for me to move my images around or give them more meaningful names without losing the work done on the image.

Line 10 predeclares the warnif() subroutine, used in the middle of the program, but not defined until the end, so that I can use it without parens.

Lines 12 to 15 collect my “likely to change” constants.

Line 13 defines the fuzz factor. If the average absolute difference between each of the corresponding 48 numbers in each of the vectors of the two images doesn’t exceed this value, it’s a match. I found 5 to be a nice compromise between too many false positives (two similar images being declared identical) and too many false negatives (not seeing a pair when the pair was there).

If the value of $CORRUPT in line 14 is defined, then a directory by that name is created if necessary, and any corrupt image (according to Image::Magick) is renamed into the directory as they’re seen. If set to undef instead, then the image is merely ignored (with a warning).

Lines 17 to 20 define a file-based cache, located below the invoker’s home directory. I use the glob operator to expand ~/.filecache to its full name. The glob is used in a literal slice to extract the first of what is hopefully only one response value. (There are about a dozen other ways of doing this, but this one worked the first time I tried it, and should hopefully be sufficiently portable.)

Line 22 defines the array holding the “buckets.” Each bucket contains an arrayref. The first element of that referenced array is itself an arrayref of the 48-integer vector for all images in that bucket. The remaining elements are the filenames.

The bucket strategy is a simple linear search: as each new file is examined, its vector is compared to the vectors of all previously computed buckets. If there’s a match, the name is added to the end of the list. If there’s no match, a new bucket is added to the end, with the unmatched vector and initially, the one filename for that vector. While I first imagined this to be hideously slow (it’s an O(N2) algorithm, I think), in practice I found that I could identify matching images of an 8,000-image test directory in about two or three CPU minutes on my laptop. That’s “fast enough,” as they say.

Lines 25 to 85 form the main per-file loop. Because I had to pop out of nested loops, I gave this loop a label of FILE. It’s good to name loops based on the noun of the thing being processed, so the shorthand of next FILE reads nicely. The loop reads @ARGV for as long as it’s there, being peeled off one element at a time into $file in line 26.

Lines 27 to 33 permit the list of names to include directories, which if found, are recursively processed. If a directory name is seen, then the directory is opened, and the contents are examined. Every file beginning with a dot is discarded, and the remaining elements have the directory name prepended to them. By unshifting the value back into @ARGV, we get a depth-first scan of the directories.

While I probably could have used File::Find here, I decided to open-code the steps instead. I had already gotten the program to work for all of the files specified on the command line, but I wanted to test the program with even more files than are permitted on a command line. By adding the code to replace any directory with its contents, I could just use . as one of the entries instead, and it seemed to work as needed.

Line 35 ensures that I process only plain files. I test against the magic underscore filehandle, which contains a cache of the information on the previous test (line 27) to avoid making redundant system calls for information.

Line 37 to 39 compute the cache key for this file, a value that remains constant even if the file is renamed. The dev/ino pair of numbers uniquely define the specific Unix inode on the specific disk on which the file is located, which doesn’t change as long as the file is merely renamed. And the modification timestamp also remains constant if the file is renamed, but will change if the contents are somehow altered. Thus, the triple of dev/ino/mtime is a useful way to track a particular version of an item, even if it’s been renamed.

Line 41 defines the array to hold the 48-element vector for the image, while line 43 provides a lable for tracing the operation.

Lines 44 to 46 try to get the vector simply by looking at the cache. If that’s possible, then we don’t have any hard work to do for this particular image.

Otherwise, starting in line 48, the vector is computed. First, an Image::Magick object is created (line 48) and then the file is read in as an image (line 49). If there’s an error, we decide if the file needs to be renamed (line 50). If the error string contains corrupt or unexpected end-of-file (the two cases I saw most frequently), then lines 51 to 53 move the file into the designated directory, keeping its original image name, but discarding the source directory. Yes, this would be a problem if I was scanning a large hierarchy, but then again, the image is already corrupt, so it’s probably no big deal. If there’s any other error, the file is merely skipped and noted (line 56).

Once the image object is created, we show some statistics in line 62 about the image size. Lines 63 through 65 normalize the image (adjusting the brightness and contrast for maximum range), change the image to a 4×4 grid, and then set the “type” to rgb, which permits us to extract the RGB triples in line 66. ImageToBlob returns a 48-byte string that subsequently gets expanded by the unpack operation into a series of values from 0 to 255. Line 67 stores this vector into the cache for later extraction.

Lines 71 to 82 implement the bucket matching algorithm described earlier. First, a linear scan is made of all the existing buckets using the loop starting in line 71. Line 72 defines the accumulated error as we look at pairs of values from the two vectors being compared. Line 74 walks an index value through the list of 0 through 47, so we can examine each pair. Line 75 increases the error sum by the new differences value. If the value exceeds the total permitted, according to the fuzz factor, then we bail and go on to the next bucket. Note that this means that for many vector comparisons, we can abort very early, as soon as the sum of the differences exceeds 240 units (5 times 48). Thus, if the upper left pixel of the new image is bright red, but the upper left pixel of the image being compared is black, the difference is already 255, and we abort after one comparison. I think this is the part that allows me to run the 8,000 images in reasonable time.

If a given image matches “close enough” to a particular bucket, we end up at line 79. We add the filename to the end of the bucket list, and then report that this image has been identified as similar enough, giving a list of all previous matching names. Note that this is not necessarily the final list, because later images may also match this bucket. Also note that this set of comparisons isn’t quite right, because we use the vector of only the first image of a bucket for all remaining comparisons, even though there’s a slim possibility that a new image might have matched one of the other bucket members instead. Again, this is still close enough and fast enough for my purposes.

If the bucket list is completely scanned, but we still haven’t found a match, we end up in line 76, which simply adds a new bucket with the current vector and filename.

Lines 87 to 105 handle the interesting task of displaying the similar images, letting me choose which (if any) to delete. First, the bucket is turned into an array in line 88, and the vector is discarded in line 89. If there’s only one item in the bucket, we skip it in line 90.

Lines 91 to 94 create a montage of the images. First, a container image is created, then loaded with all of the images using their names in line 92. Then a montage is created that scales each image to within a 400 by 400 thumbnail, annotated below by the image number (in brackets), the image file name, the width and height, and the number of bytes of the image file size (often an indication of differing JPEG quality).

Lines 95 and 96 display this montage onto my currently selected X11 server. Usually, I just need to glance at the images, determine if they are the same, and if so, which one (or many) needs to be deleted by noting the image number in the brackets. I then press space to dismiss this image window, and am presented with a prompt in line 97. If I want to delete any of the values, I enter the integer value there, or just press return if I don’t want to delete anything. I can enter more than one value by using spaces or commas as a delimiter, because line 90 rips out any consecutive integer sequence as one of the values. The grep keeps the integers from being out of range (I once typed 0 and deleted the wrong image by mistake).

Lines 100 to 104 take this list of integers and gets the corresponding filenames back from the image object. Note that I don’t use the original @names array, because I’m paranoid that maybe Image::Magick might have skipped over some files that have disappeared or become unreadable between the time that I scanned them and put them into the bucket and the time that I’m now displaying their montage. Line 102 provides a trace of the action, and line 103 does the deed, blowing away that redundant waste of disk space.

And there you have it. Lots of stuff going on there, but simple in its design, and yet powerful and customizable. And now my disk is getting cleaner, because I can remove all of those redundant images. Until next time, enjoy!

Listing One: A Perl script to find similar images

1 #!/usr/bin/perl -w
2 use strict;
3 $|++;
5 use File::Copy qw(move);
7 use Image::Magick;
8 use Cache::FileCache;
10 sub warnif;
12 ## config
13 my $FUZZ = 5; # permitted average deviation in the vector elements
14 my $CORRUPT = “CORRUPT”; # if defined, rename corrupt images into this dir
15 ## end config
17 my $cache = Cache::FileCache->new({
18 namespace => ‘findimagedupes’,
19 cache_root => (glob(“~/.filecache”))[0],
20 });
22 my @buckets;
24 FILE:
25 while (@ARGV) {
26 my $file = shift;
27 if (-d $file) {
28 opendir DIR, $file or next FILE;
29 unshift @ARGV, map {
30 /^\./ ? () : “$file/$_”;
31 } sort readdir DIR;
32 next FILE;
33 }
35 next FILE unless -f _;
37 my (@stat) = stat(_) or die “should not happen: $!”;
39 my $key = “@stat[0, 1, 9]“; # dev/ino/mtime
41 my @vector;
43 print “$file “;
44 if (my $data = $cache->get($key)) {
45 print “… is cached\n”;
46 @vector = @$data;
47 } else {
48 my $image = Image::Magick->new;
49 if (my $x = $image->Read($file)) {
50 if (defined $CORRUPT and $x =~ /corrupt|unexpected end-of-file/i) {
51 print “… renaming into $CORRUPT\n”;
52 -d $CORRUPT or mkdir $CORRUPT, 0755 or
53 die “Cannot mkdir $CORRUPT: $!”;
54 move $file, $CORRUPT or warn “Cannot rename: $!”;
55 } else {
56 print “… skipping ($x)\n”;
57 }
59 next FILE;
60 }
62 print “is “, join(“x”,$image->Get(‘width’, ‘height’)), “\n”;
63 warnif $image->Normalize();
64 warnif $image->Resize(geometry => ’4×4!’);
65 warnif $image->Set(magick => ‘rgb’);
66 @vector = unpack “C*”, $image->ImageToBlob();
67 $cache->set($key, [@vector]);
68 }
71 for my $bucket (@buckets) {
72 my $error = 0;
74 for my $index (0..$#vector) {
75 $error += abs($bucket->[0][$index] – $vector[$index]);
76 next BUCKET if $error > $FUZZ * @vector;

77 }

79 push @$bucket, $file;
80 print “linked “, join(“, “, @$bucket[1..$#$bucket]), “\n”;
81 next FILE;
82 }
84 push @buckets, [[@vector], $file];
85 }
87 for my $bucket (@buckets) {
88 my @names = @$bucket;
89 shift @names; # first element is vector
90 next unless @names > 1; # skip unique images
91 my $images = Image::Magick->new;
92 $images->Read(@names);
93 my $montage = $images->Montage(geometry => ’400×400′,
94 label => “[%p] %i %wx%h %b”);
95 print “processing…\n”;
96 $montage->Display();
97 print “Delete? [none] “;
99 my @dead = grep { $_ >= 1 and $_ <= @$images } <STDIN> =~ /(\d+)/g;
100 for (@dead) {
101 my $dead_name = $images->[$_ - 1]-> Get(‘base-filename’);
102 warn “rm $dead_name\n”;
103 unlink $dead_name or warn “Cannot rm $dead_name: $!”;
104 }
105 }
107 use Carp qw(carp);
108 sub warnif {
109 my $value = shift;
110 carp $value if $value;
111 }

Randal L. Schwartz is the chief Perl guru at Stonehenge Conulting and can be reached at merlyn@stonehenge.com. You can download the code described in this column from http:// www.linuxmagazine.com/downloads/2003-08/perl.



Linux Magazine /
August 2003 / PERL OF WISDOM
Finding Similar Images

Comments are closed.