I admit it. Like anyone else with a decent-speed connection to the Internet, I collect a lot of images. For example, a few months ago, I described a program that looks through Yahoo! news images for pictures of Oregon and some of my favorite singing stars. Sometimes, an image travels multiple paths before it ends up on my disk, and thus gets saved under different names. But that’s a waste of disk space, so I want to eliminate duplicates where I can.
I admit it. Like anyone else with a decent-speed connection to the Internet, I collect a lot of images. For example, a few months ago, I described a program that looks through Yahoo! news images for pictures of Oregon and some of my favorite singing stars. Sometimes, an image travels multiple paths before it ends up on my disk, and thus gets saved under different names. But that’s a waste of disk space, so I want to eliminate duplicates where I can.
At first, I was using a simple tool to compute the MD5 hash of each image in my collection. That technique easily eliminated exact duplicates, but frequently, the same image was different enough to confound the tool. For example, if the original image had been scaled, re-rendered at a different JPEG quality, or converted from JPEG to PNG, the actual bits are different and the MD5 hashes differ, even though the image is nearly the same on my screen.
So, a few days ago, I set out to write a program that could find similar images, not just identical images. My strategy is to reduce each image to a 4-by-4 grid of RGB values, yielding a 48-number vector of values from 0 to 255. Regardless of the re-rendering or resizing of the image (or even minor touchups), the vector should be identical (or close) for images with the same original source. After a few hours of experimentation and tweaking, the results of my work…
Please log in to view this content.
Not Yet a Member?
Register with LinuxMagazine.com and get free access to the entire archive, including: