The aim of this project is to add Indic script support to the Tesseract OCR engine, which currently does not support connected script such as devnagri. This includes adding some routines to the existing code base, training the engine with sample images and then testing for accuracy for subsequent debugging and refinement in the algorithms.
Tools and used software
Tesseract OCR engine 2.03 http://code.google.com/p/tesseract-ocr/
Gimp 2.2.17 http://www.gimp.org/
bbtesseract (GUI for editing training data, such as box files) 0.5.34 http://code.google.com/p/bbtesseract/
Project Plan: Take the input image and then manipulate it in a manner so that it then fit to be processed by the Tesseract OCR engine. For devnagri scripts, it translates to clipping the maatra(shironaam) between successive characters.