x
Loading
 Loading
Hello, Guest | Login | Register

OCR for Linux: Teaching Linux to Read

Rod Smith covers the optical character recognition (OCR) options for Linux, their limitations, and how to install and use Tesseract for your OCR needs on Linux.

Computers are excellent number-crunching machines, but they’ve traditionally been very poor at dealing with the” fuzzier” everyday world at which humans excel. Ask a computer to add a thousand numbers and it wouldn’t blink an eye if it had one; however, ask a computer to read those thousand numbers from a sheet of paper and you’ll run into problems. Even with a scanner attached to the computer, a computer will have a hard time recognizing printed numbers (or, generalizing a bit, letters and punctuation) for what they are — a task that even kindergarten children can master.

The software that attempts to teach computers about the printed alphabet and words is known as Optical Character Recognition(OCR) software. This broad class of software accepts as input a graphics file containing a scan of a printed page and outputs a text file. The input file format can, in principle, be anything — it can be in PNM, TIFF, or some other format.

In some cases the OCR software can use a scanner directly, bypassing the need to store a file on disk. Similarly, the output file format can be just about anything — a plain-text ASCII file, a word processor document, a PDF, or what have you. In any case, the challenge is the same: Giving the computer the ability to recognize individual letters. This, rather than the mundane challenge of reading or writing particular computer file formats, is the challenge of OCR.

OCR Software Capabilities and Limitations

OCR can be very…

Please log in to view this content.

Not Yet a Member?

Register with LinuxMagazine.com and get free access to the entire archive, including:

  • Hands-on Content
  • White Papers
  • Community Features
  • And more.
Already a Member?
Log in!
Username

Password

Remember me

Forgotten your password?
Forgotten your username?
Read More
  1. Linux Needs Open Multimedia on the Web
  2. Ripping Videos for MythTV with AcidRip
  3. Review: Banshee 1.0
  4. The Big Show
  5. Stretching Your Desktop Onto a Second Monitor with the Millennium G400 Card
Follow Linux Magazine
Rackspace