Tesseract-ocr is now at google

by Dipin Krishna
July 17, 2009

Tesseract-ocr was an OCR Engine that was developed at HP Labs between 1985 and 1995… and is now at Google.
The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.

Important Download Information:

The language data files are separate from the code!

See the ReadMe wiki for installation and usage information!

Additional installation and usage information can be found in the FAQ wiki.

Supported Platforms

The developers are regularly testing on the following platforms:

Ubuntu 6.06 (x86/32, x86/64)
Ubuntu 6.10 (x86/32, x86/64)
Windows (x86/32) with Visual C++ Express 2008

The upcoming 3.00 release will probably include:

Page layout analysis.
Automatic page orientation and script detection capability.
Special modes for single column, line, word and even character.
Improved API ready for thread-safety.
Many more languages, including Chinese.

Please check out the ReadMe before going to Downloads as you need more than one file.

Original document: http://code.google.com/p/tesseract-ocr/

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.