Rod Smith covers the optical character recognition (OCR) options for Linux, their limitations, and how to install and use Tesseract for your OCR needs on Linux.
Computers are excellent number-crunching machines, but they’ve traditionally been very poor at dealing with the” fuzzier” everyday world at which humans excel. Ask a computer to add a thousand numbers and it wouldn’t blink an eye if it had one; however, ask a computer to read those thousand numbers from a sheet of paper and you’ll run into problems. Even with a scanner attached to the computer, a computer will have a hard time recognizing printed numbers (or, generalizing a bit, letters and punctuation) for what they are — a task that even kindergarten children can master.
The software that attempts to teach computers about the printed alphabet and words is known as Optical Character Recognition(OCR) software. This broad class of software accepts as input a graphics file containing a scan of a printed page and outputs a text file. The input file format can, in principle, be anything — it can be in PNM, TIFF, or some other format.
In some cases the OCR software can use a scanner directly, bypassing the need to store a file on disk. Similarly, the output file format can be just about anything — a plain-text ASCII file, a word processor document, a PDF, or what have you. In any case, the challenge is the same: Giving the computer the ability to recognize individual letters. This, rather than the mundane challenge of reading or writing particular computer file formats, is the challenge of OCR.
OCR Software Capabilities and Limitations
OCR can be very tough; for instance, the difference between a lower-case L (l) and a digit one (1) is very small. Scans are seldom perfect, which complicates the comparison; even the same letter will consist of a different bit pattern at different points on the page. Add to the challenge by requiring the software to recognize text in a variety of fonts and you can begin to see why this task, which is so easy for humans, is a difficult one for computers to master.
The difficulty in correctly identifying individual characters or groups of characters is largely a technical challenge, and one that OCR developers are working to resolve. In fact, the best of today’s OCR software does reasonably well in this respect; the better software can achieve over 95% accuracy, at least when it’s not challenged by poor scans, bizarre fonts, or otherwise substandard input. Of course, even a 5% error rate (1 in 20 characters) can be unacceptable for some purposes; for instance, consider that this equates to four errors per line in an 80-character-per-line text. Still, it’s probably easier to proofread a text and change 5% of its characters than to manually re-type the document.
OCR software often has limitations other than the raw accuracy rate, though. For instance, one of the better Linux programs, Tesseract, accepts only uncompressed TIFFs as input and generates only Unicode text files as output. Tesseract also doesn’t recognize multi-column formatting or embedded figures and will therefore output scrambled (but correct) text from multi-column scans and gibberish in place of embedded graphics. To use a program such as Tesseract on such scans, you must manually edit the scanned images to reduce them to single columns of text — a process that, combined with less-than-ideal accuracy rates, may make the process more work than it’s worth.
Some programs aren’t as limited at handling advanced formatting. Although some still require setting options manually to tell the software how many columns exist, or where graphics are located on the page, they can at least handle these features. Unfortunately, to the best of my knowledge the best Linux OCR packages in terms of character recognition are deficient in terms of the handling of formatting, so you have to make a choice: high accuracy or easy OCR from complex layouts.
An Overview of Linux OCR Options
What are the Linux OCR options, though? A quick search of the Web reveals several. The better, or higher-profile, options for Linux include:
Tesseract (http://code.google.com/p/tesseract-ocr/) — This program, as just noted, is among the most accurate available for Linux. It is, however, hampered by limited input and output options. If you need to scan plain text with relatively straightforward formatting, though, it’s an excellent option.
Ocropus (http://code.google.com/p/ocropus/) — This package is built on top of Tesseract, and provides more advanced layout handling and other features that Tesseract lacks. It’s currently in very early stages of development, though, so you should only try it if you want to live on the” bleeding edge” of the Linux OCR world.
gscan2pdf (http://gscan2pdf.sourceforge.net) — This program’s claim to fame is that it scans directly from a scanner and outputs directly to a PDF. This combination is useful if you happen to want to produce a PDF from a printed document you have in hand. However, this limits your proofreading and correction options.
GNU Ocrad (http://www.gnu.org/software/ocrad/ocrad.html) — This program is similar to Tesseract in overall capabilities, although it takes PBM, PGM, or PPM files rather than TIFF files and it includes an analyzer to handle multi-column layouts. In various tests, it hasn’t fared as well as Tesseract in overall accuracy.
GOCR (http://jocr.sourceforge.net) — GOCR is an OCR engine that’s intended to be used by other programs. It can be used as a stand-alone OCR program, though.
Clara OCR (http://freshmeat.net/projects/claraocr/) — This OCR package is intended for use in distributed OCR of old books. It includes both standalone and Web-based tools to aid in this goal.
VueScan (http://www.hamrick.com) — This shareware program is primarily a tool for scanning photos (including both prints and transparencies), but it incorporates a good OCR engine as well. In my informal tests, VueScan’s OCR seemed about as accurate as Tesseract.
This list is not comprehensive; a simple Web search will reveal many other Linux OCR packages. Unfortunately, the wealth of options means that I can’t cover all of the options, so I’ve chosen to describe how to use Tesseract.
I’ve chosen to describe Tesseract in greater depth because it’s currently the best open source Linux OCR software in terms of accuracy. Its limited layout and graphics support are limitations, but if you need OCR to convert simple text documents into plain text, Tesseract is the best tool for the job, to the best of my knowledge.
Although Tesseract is usable, it’s not yet included in most distributions. So, chances are you’ll need to download it from its Google Code page. Be sure to obtain the main Tesseract source code file (tesseract-2.01.tar.gz as I write) and the file (s) for whatever language (s) you intend to scan (such as tesseract-2.00.eng.tar.gz for English). As of Tesseract 2.01, eight languages are supported: Dutch, English, Fraktur (Old German), French, German, Italian, Portuguese, and Spanish. Tesseract ships with tools to assist in the creation of support files for other languages, but creating these files will require substantial additional work.
Compilation and installation of the main Tesseract package is fairly conventional:
1.Unpack the source code.
2.Enter the main source code directory.
./configure to configure the software.
make to compile the software.
make install to install the software.
This procedure will install Tesseract, but not its language-specific data files, which you must download separately from Tesseract itself, as noted earlier. Unfortunately, Tesseract is a bit peculiar in that it requires read/write access to its language data files. This leaves you with three options for how to proceed with installation and use of Tesseract:
1.You may install the language files in a system directory (such as /usr/local/share/tessdata) and run Tesseract only as
root. This option has obvious negative security implications, so I don’t recommend it.
2.You may install the language files in a system directory, as in option# 1, but modify the permissions on the directory and its files so that one or more users may write to the files.
3.All Tesseract users may install copies of the language files in their own home directories. If users unpack the tesseract-2.00.eng.tar.gz file in their home directories, the result will be a ~/tessdata subdirectory in each home directory. Users may then set the
TESSDATA_PREFIX environment variable to point to their home directories; Tesseract will look in the tessdata subdirectory of the directory specified by
TESSDATA_PREFIX to locate language files. Note that the
TESSDATA_PREFIX path must end in a slash (/).
Option# 3 is probably best if only a handful of users require access to Tesseract; however, it may become awkward on systems with many users. Although option# 3 is preferable to option# 2 from a security point of view, option# 2 may be the better choice as a practical matter.
Tesseract’s use is fairly straightforward: Pass it the filename of a TIFF input file and an output filename (minus the .txt filename extension):
tesseract sample.tif sample
This command should create an output file called sample.txt containing the Unicode text conversion of sample.tif. If you’ve installed multiple language files or if Tesseract is trying to use a language file you didn’t install, you can pass a language code using the
-l option, as in
-l eng to force Tesseract to read the file as English.
Depending on compilation options and your installed libraries (including development libraries), Tesseract might or might not accept compressed TIFF files. Thus, you might need to convert more advanced TIFF files to simple uncompressed format. Likewise, you’ll need to convert non-TIFF files into TIFF format.
On the output side, Tesseract generates Unicode files. These files resemble plain-text ASCII files, but they include additional non-ASCII characters. Tesseract makes liberal use of these characters for certain symbols that are common in many texts, such as em-dashes and” curly” quote marks. If you try to view Tesseract’s output using simple Linux utilities such as
less, you may find that the text is difficult to read because of these Unicode characters, which simple Linux tools may not handle very well. A Unicode-capable editor, such as Yudit (http://www.yudit.org), will enable you to read your text, with Unicode characters displayed correctly.
Tips for Successful OCR Use
Although OCR programs often produce readable text files, they do have their limitations. Bad scans, unusual fonts in the original documents, complex layouts, and other issues can complicate scanning. Some issues, such as the font used in the original document, you can’t control. Others, though, you can influence. A few procedures and additional programs can help you obtain good OCR results.
Before you begin scanning a document, examine the original. If possible, find the best copy of the original — for instance, if you have a choice between scanning an original document produced by a laser printer or a photocopy of that original, chances are you’ll get better results with the laser printed original. If your document is wrinkled or stained, do whatever you can to clean it up. Ironing a wrinkled document or pressing it between heavy objects to flatten it may help produce a clean scan. If somebody’s scribbled notes on the original, consider erasing them or using white-out to remove them — or plan on removing these notes from the scan itself using the GIMP.
Before starting a scan, examine your scanner’s glass. Cleaning it may help produce a clear scan, free of stray marks caused by dust or smudges. When performing the scan, align the document as well as you can. If you’re scanning a book or magazine, press it as flat as possible against the scanner’s glass. When scanning individual sheets, shut the lid of the scanner. This will both eliminate dark borders that might confuse the OCR software and help keep the document flat.
Most OCR software works best with scans of a particular size, which typically translates into a scanning resolution of about 300 dots per inch (dpi), so you should set your scanning software to that resolution. The color setting may impact OCR quality. Although many OCR programs, including Tesseract, claim to work with two-color, gray-scale, and color scans, one or another of these settings is likely to work best. You may need to experiment to learn what works well with the software you’re using and with the types of documents you scan. My own tests show that Tesseract works well with 300 dpi scans, but very poorly with 100 dpi or 1200 dpi scans. Color settings don’t seem to greatly impact Tesseract.
Whatever your color setting and resolution, you should adjust the contrast and brightness so that the background is as near to pure white as you can make it while keeping the text black and legible. Random noise or” bleed-through” of text from the back side of a page can interfere with successful OCR operation, as can text that’s too faint. When scanning books, the edge near the spine can sometimes disappear into darkness if contrast and brightness aren’t set correctly. Crop your scan area to eliminate dark borders and graphics, or remove those elements after scanning using the GIMP.
You may find it necessary to convert your scanned pages from one format to another, or even from one TIFF variant to another — say, to remove compression from TIFF images. You can do this in the GIMP or using text-mode tools, such as
pnmtotiff (part of the
netpbm package) or
tiff2bw (part of the
tiff package). These command-line tools make conversion from one graphics format to another easy; most take an input filename and either take an output filename as a second argument or send output to standard output. Try typing the original file’s format (such as
png) and then hitting the Tab key to see what conversion utilities are available, then read the relevant man pages to learn how to use the tools.
If you’re dealing with documents that have already been scanned but that are somewhat skewed in orientation, a tool called Unpaper (http://freshmeat.net/projects/unpaper/) may be useful. It’s intended to process scanned files to make them more likely to be easily read, by either a human or an OCR package. It removes dark edges and rotates the text to minimize skew.
The Future of OCR
With these pointers in hand, you should be able to make effective use of Linux OCR software. This is still an area in which Linux solutions are less slick than those for other platforms, though. Fortunately, work is underway to improve Linux OCR tools. The Tesseract engine is under active development and is being adopted by projects, such as Ocropus, which aim to present a better user interface or perform pre-OCR processing on raw scans. If OCR capabilities are important to you, you should keep your eyes open for future developments.