OCR for Linux: Teaching Linux to Read

Rod Smith covers the optical character recognition (OCR) options for Linux, their limitations, and how to install and use Tesseract for your OCR needs on Linux.

Computers are excellent number-crunching machines, but they’ve traditionally been very poor at dealing with the” fuzzier” everyday world at which humans excel. Ask a computer to add a thousand numbers and it wouldn’t blink an eye if it had one; however, ask a computer to read those thousand numbers from a sheet of paper and you’ll run into problems. Even with a scanner attached to the computer, a computer will have a hard time recognizing printed numbers (or, generalizing a bit, letters and punctuation) for what they are — a task that even kindergarten children can master.

The software that attempts to teach computers about the printed alphabet and words is known as Optical Character Recognition(OCR) software. This broad class of software accepts as input a graphics file containing a scan of a printed page and outputs a text file. The input file format can, in principle, be anything — it can be in PNM, TIFF, or some other format.

In some cases the OCR software can use a scanner directly, bypassing the need to store a file on disk. Similarly, the output file format can be just about anything — a plain-text ASCII file, a word processor document, a PDF, or what have you. In any case, the challenge is the same: Giving the computer the ability to recognize individual letters. This, rather than the mundane challenge of reading or writing particular computer file formats, is the challenge of OCR.

OCR Software Capabilities and Limitations

OCR can be very tough; for instance, the difference between a lower-case L (l) and a digit one (1) is very small. Scans are seldom perfect, which complicates the comparison; even the same letter will consist of a different bit pattern at different points on the page. Add to the challenge by requiring the software to recognize text in a variety of fonts and you can begin to see why this task, which is so easy for humans, is a difficult one for computers to master.

The difficulty in correctly identifying individual characters or groups of characters is largely a technical challenge, and one that OCR developers are working to resolve. In fact, the best of today’s OCR software does reasonably well in this respect; the better software can achieve over 95% accuracy, at least when it’s not challenged by poor scans, bizarre fonts, or otherwise substandard input. Of course, even a 5% error rate (1 in 20 characters) can be unacceptable for some purposes; for instance, consider that this equates to four errors per line in an 80-character-per-line text. Still, it’s probably easier to proofread a text and change 5% of its characters than to manually re-type the document.

OCR software often has limitations other than the raw accuracy rate, though. For instance, one of the better Linux programs, Tesseract, accepts only uncompressed TIFFs as input and generates only Unicode text files as output. Tesseract also doesn’t recognize multi-column formatting or embedded figures and will therefore output scrambled (but correct) text from multi-column scans and gibberish in place of embedded graphics. To use a program such as Tesseract on such scans, you must manually edit the scanned images to reduce them to single columns of text — a process that, combined with less-than-ideal accuracy rates, may make the process more work than it’s worth.

Some programs aren’t as limited at handling advanced formatting. Although some still require setting options manually to tell the software how many columns exist, or where graphics are located on the page, they can at least handle these features. Unfortunately, to the best of my knowledge the best Linux OCR packages in terms of character recognition are deficient in terms of the handling of formatting, so you have to make a choice: high accuracy or easy OCR from complex layouts.

An Overview of Linux OCR Options

What are the Linux OCR options, though? A quick search of the Web reveals several. The better, or higher-profile, options for Linux include:

  • Tesseract (http://code.google.com/p/tesseract-ocr/) — This program, as just noted, is among the most accurate available for Linux. It is, however, hampered by limited input and output options. If you need to scan plain text with relatively straightforward formatting, though, it’s an excellent option.

  • Ocropus (http://code.google.com/p/ocropus/) — This package is built on top of Tesseract, and provides more advanced layout handling and other features that Tesseract lacks. It’s currently in very early stages of development, though, so you should only try it if you want to live on the” bleeding edge” of the Linux OCR world.

  • gscan2pdf (http://gscan2pdf.sourceforge.net) — This program’s claim to fame is that it scans directly from a scanner and outputs directly to a PDF. This combination is useful if you happen to want to produce a PDF from a printed document you have in hand. However, this limits your proofreading and correction options.

  • GNU Ocrad (http://www.gnu.org/software/ocrad/ocrad.html) — This program is similar to Tesseract in overall capabilities, although it takes PBM, PGM, or PPM files rather than TIFF files and it includes an analyzer to handle multi-column layouts. In various tests, it hasn’t fared as well as Tesseract in overall accuracy.

  • GOCR (http://jocr.sourceforge.net) — GOCR is an OCR engine that’s intended to be used by other programs. It can be used as a stand-alone OCR program, though.

  • Clara OCR (http://freshmeat.net/projects/claraocr/) — This OCR package is intended for use in distributed OCR of old books. It includes both standalone and Web-based tools to aid in this goal.

  • VueScan (http://www.hamrick.com) — This shareware program is primarily a tool for scanning photos (including both prints and transparencies), but it incorporates a good OCR engine as well. In my informal tests, VueScan’s OCR seemed about as accurate as Tesseract.

This list is not comprehensive; a simple Web search will reveal many other Linux OCR packages. Unfortunately, the wealth of options means that I can’t cover all of the options, so I’ve chosen to describe how to use Tesseract.

Tesseract Installation

I’ve chosen to describe Tesseract in greater depth because it’s currently the best open source Linux OCR software in terms of accuracy. Its limited layout and graphics support are limitations, but if you need OCR to convert simple text documents into plain text, Tesseract is the best tool for the job, to the best of my knowledge.

Although Tesseract is usable, it’s not yet included in most distributions. So, chances are you’ll need to download it from its Google Code page. Be sure to obtain the main Tesseract source code file (tesseract-2.01.tar.gz as I write) and the file (s) for whatever language (s) you intend to scan (such as tesseract-2.00.eng.tar.gz for English). As of Tesseract 2.01, eight languages are supported: Dutch, English, Fraktur (Old German), French, German, Italian, Portuguese, and Spanish. Tesseract ships with tools to assist in the creation of support files for other languages, but creating these files will require substantial additional work.

Compilation and installation of the main Tesseract package is fairly conventional:

1.Unpack the source code.

2.Enter the main source code directory.

3.Type ./configure to configure the software.

4.Type make to compile the software.

5.As root, type make install to install the software.

This procedure will install Tesseract, but not its language-specific data files, which you must download separately from Tesseract itself, as noted earlier. Unfortunately, Tesseract is a bit peculiar in that it requires read/write access to its language data files. This leaves you with three options for how to proceed with installation and use of Tesseract:

1.You may install the language files in a system directory (such as /usr/local/share/tessdata) and run Tesseract only as root. This option has obvious negative security implications, so I don’t recommend it.

2.You may install the language files in a system directory, as in option# 1, but modify the permissions on the directory and its files so that one or more users may write to the files.

3.All Tesseract users may install copies of the language files in their own home directories. If users unpack the tesseract-2.00.eng.tar.gz file in their home directories, the result will be a ~/tessdata subdirectory in each home directory. Users may then set the TESSDATA_PREFIX environment variable to point to their home directories; Tesseract will look in the tessdata subdirectory of the directory specified by TESSDATA_PREFIX to locate language files. Note that the TESSDATA_PREFIX path must end in a slash (/).

Option# 3 is probably best if only a handful of users require access to Tesseract; however, it may become awkward on systems with many users. Although option# 3 is preferable to option# 2 from a security point of view, option# 2 may be the better choice as a practical matter.

Using Tesseract

Tesseract’s use is fairly straightforward: Pass it the filename of a TIFF input file and an output filename (minus the .txt filename extension):

tesseract sample.tif sample

This command should create an output file called sample.txt containing the Unicode text conversion of sample.tif. If you’ve installed multiple language files or if Tesseract is trying to use a language file you didn’t install, you can pass a language code using the -l option, as in -l eng to force Tesseract to read the file as English.

Depending on compilation options and your installed libraries (including development libraries), Tesseract might or might not accept compressed TIFF files. Thus, you might need to convert more advanced TIFF files to simple uncompressed format. Likewise, you’ll need to convert non-TIFF files into TIFF format.

On the output side, Tesseract generates Unicode files. These files resemble plain-text ASCII files, but they include additional non-ASCII characters. Tesseract makes liberal use of these characters for certain symbols that are common in many texts, such as em-dashes and” curly” quote marks. If you try to view Tesseract’s output using simple Linux utilities such as cat or less, you may find that the text is difficult to read because of these Unicode characters, which simple Linux tools may not handle very well. A Unicode-capable editor, such as Yudit (http://www.yudit.org), will enable you to read your text, with Unicode characters displayed correctly.

Tips for Successful OCR Use

Although OCR programs often produce readable text files, they do have their limitations. Bad scans, unusual fonts in the original documents, complex layouts, and other issues can complicate scanning. Some issues, such as the font used in the original document, you can’t control. Others, though, you can influence. A few procedures and additional programs can help you obtain good OCR results.

Before you begin scanning a document, examine the original. If possible, find the best copy of the original — for instance, if you have a choice between scanning an original document produced by a laser printer or a photocopy of that original, chances are you’ll get better results with the laser printed original. If your document is wrinkled or stained, do whatever you can to clean it up. Ironing a wrinkled document or pressing it between heavy objects to flatten it may help produce a clean scan. If somebody’s scribbled notes on the original, consider erasing them or using white-out to remove them — or plan on removing these notes from the scan itself using the GIMP.

Before starting a scan, examine your scanner’s glass. Cleaning it may help produce a clear scan, free of stray marks caused by dust or smudges. When performing the scan, align the document as well as you can. If you’re scanning a book or magazine, press it as flat as possible against the scanner’s glass. When scanning individual sheets, shut the lid of the scanner. This will both eliminate dark borders that might confuse the OCR software and help keep the document flat.

Most OCR software works best with scans of a particular size, which typically translates into a scanning resolution of about 300 dots per inch (dpi), so you should set your scanning software to that resolution. The color setting may impact OCR quality. Although many OCR programs, including Tesseract, claim to work with two-color, gray-scale, and color scans, one or another of these settings is likely to work best. You may need to experiment to learn what works well with the software you’re using and with the types of documents you scan. My own tests show that Tesseract works well with 300 dpi scans, but very poorly with 100 dpi or 1200 dpi scans. Color settings don’t seem to greatly impact Tesseract.

Whatever your color setting and resolution, you should adjust the contrast and brightness so that the background is as near to pure white as you can make it while keeping the text black and legible. Random noise or” bleed-through” of text from the back side of a page can interfere with successful OCR operation, as can text that’s too faint. When scanning books, the edge near the spine can sometimes disappear into darkness if contrast and brightness aren’t set correctly. Crop your scan area to eliminate dark borders and graphics, or remove those elements after scanning using the GIMP.

You may find it necessary to convert your scanned pages from one format to another, or even from one TIFF variant to another — say, to remove compression from TIFF images. You can do this in the GIMP or using text-mode tools, such as pnmtotiff (part of the netpbm package) or tiff2bw (part of the tiff package). These command-line tools make conversion from one graphics format to another easy; most take an input filename and either take an output filename as a second argument or send output to standard output. Try typing the original file’s format (such as tiff or png) and then hitting the Tab key to see what conversion utilities are available, then read the relevant man pages to learn how to use the tools.

If you’re dealing with documents that have already been scanned but that are somewhat skewed in orientation, a tool called Unpaper (http://freshmeat.net/projects/unpaper/) may be useful. It’s intended to process scanned files to make them more likely to be easily read, by either a human or an OCR package. It removes dark edges and rotates the text to minimize skew.

The Future of OCR

With these pointers in hand, you should be able to make effective use of Linux OCR software. This is still an area in which Linux solutions are less slick than those for other platforms, though. Fortunately, work is underway to improve Linux OCR tools. The Tesseract engine is under active development and is being adopted by projects, such as Ocropus, which aim to present a better user interface or perform pre-OCR processing on raw scans. If OCR capabilities are important to you, you should keep your eyes open for future developments.

Comments on "OCR for Linux: Teaching Linux to Read"

Although sites we backlink to below are considerably not connected to ours, we really feel they’re in fact really worth a go via, so possess a look.

Here is a good Weblog You may Uncover Interesting that we encourage you to visit.

That could be the end of this report. Here you?ll locate some sites that we think you will appreciate, just click the hyperlinks.

We came across a cool web page that you just could love. Take a look when you want.

The time to study or pay a visit to the content or web sites we’ve linked to below.

Below you?ll locate the link to some web sites that we think you ought to visit.

Below you will obtain the link to some websites that we consider you ought to visit.

Here are some links to websites that we link to mainly because we think they may be worth visiting.

Always a significant fan of linking to bloggers that I like but really don’t get a good deal of link adore from.

Wonderful story, reckoned we could combine several unrelated information, nonetheless seriously really worth taking a search, whoa did one particular master about Mid East has got a lot more problerms also.

The information mentioned inside the write-up are a few of the most beneficial available.

Here is a superb Weblog You might Find Exciting that we encourage you to visit.

Very couple of internet sites that happen to be in depth beneath, from our point of view are undoubtedly well worth checking out.

Here are some of the sites we advise for our visitors.

The time to read or pay a visit to the content or internet sites we’ve linked to below.

We like to honor a lot of other web web sites on the web, even when they aren?t linked to us, by linking to them. Beneath are some webpages worth checking out.

Wonderful story, reckoned we could combine several unrelated data, nonetheless seriously worth taking a look, whoa did one learn about Mid East has got extra problerms too.

Every the moment inside a when we choose blogs that we read. Listed below would be the newest web-sites that we select.

Here are some links to web sites that we link to because we think they may be worth visiting.

Please check out the websites we follow, which includes this 1, as it represents our picks in the web.

We came across a cool web-site that you may enjoy. Take a search when you want.

Here are a number of the sites we advocate for our visitors.

Usually posts some extremely exciting stuff like this. If you are new to this site.

We like to honor numerous other world wide web web pages around the internet, even when they aren?t linked to us, by linking to them. Below are some webpages worth checking out.

We came across a cool web page which you may enjoy. Take a look in the event you want.

One of our visitors not too long ago advised the following website.

Wonderful story, reckoned we could combine a couple of unrelated information, nonetheless genuinely really worth taking a search, whoa did one learn about Mid East has got extra problerms as well.

Here are some links to web pages that we link to for the reason that we think they are worth visiting.

Here are some hyperlinks to internet sites that we link to mainly because we consider they’re really worth visiting.

Below you?ll locate the link to some internet sites that we consider you should visit.

Here is a great Weblog You may Find Intriguing that we encourage you to visit.

Always a large fan of linking to bloggers that I adore but do not get a great deal of link appreciate from.

Here are some links to websites that we link to due to the fact we think they may be worth visiting.

Just beneath, are numerous totally not connected websites to ours, having said that, they are surely really worth going over.

We prefer to honor a lot of other net internet sites on the internet, even when they aren?t linked to us, by linking to them. Underneath are some webpages really worth checking out.

The time to read or visit the content or web pages we have linked to beneath.

Very couple of sites that come about to be comprehensive beneath, from our point of view are undoubtedly properly really worth checking out.

That could be the end of this write-up. Here you will find some web pages that we believe you will appreciate, just click the hyperlinks.

That would be the finish of this post. Here you will uncover some sites that we feel you will enjoy, just click the links.

Lovely just what I was searching for.Thanks to the author for taking his time on this one.

Very few sites that happen to become in depth beneath, from our point of view are undoubtedly very well worth checking out.

Very couple of web sites that take place to become comprehensive below, from our point of view are undoubtedly properly worth checking out.

The time to study or check out the subject material or sites we’ve linked to below.

Below you?ll come across the link to some internet sites that we assume you ought to visit.

The time to study or pay a visit to the material or web-sites we’ve linked to beneath.

Please check out the sites we follow, including this a single, because it represents our picks in the web.

The information and facts mentioned within the post are a few of the most effective readily available.

Leave a Reply