Tesseract-ocr was an OCR Engine that was developed at HP Labs between 1985 and 1995… and is now at Google.
The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.
The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.
Important Download Information:
The language data files are separate from the code!
See the ReadMe wiki for installation and usage information!
Additional installation and usage information can be found in the FAQ wiki.
The developers are regularly testing on the following platforms:
- Ubuntu 6.06 (x86/32, x86/64)
- Ubuntu 6.10 (x86/32, x86/64)
- Windows (x86/32) with Visual C++ Express 2008
The upcoming 3.00 release will probably include:
- Page layout analysis.
- Automatic page orientation and script detection capability.
- Special modes for single column, line, word and even character.
- Improved API ready for thread-safety.
- Many more languages, including Chinese.
Please check out the ReadMe before going to Downloads as you need more than one file.
Original document: http://code.google.com/p/tesseract-ocr/