Monday, October 02, 2006

Google's Tesseract OCR engine is a quantum leap forward

The open source optical character recognition (OCR) landscape got dramatically better recently when Google released the Tesseract OCR engine as open source software.

The Tesseract code was written at Hewlett-Packard in the 1980s and '90s. In 1995, it was one of the top-tier performers at UNLV's OCR competition, but when HP withdrew from the OCR software marketplace, the code languished. Then in 2005, HP handed off the code to UNLV's Information Science Research Institute (ISRI), an academic center doing ongoing research into OCR and related topics. ISRI discovered that original Tesseract developer Ray Smith was now an employee at Google, and asked the search engine giant if it was interested in the code. Google spent a few months updating the code to compile on modern operating systems, and released it on SourceForge.net.

You can download the latest tarball, a bugfix release numbered 1.0.1, from the Tesseract OCR project page. The only compilation instructions are those listed on the release notes section of the SourceForge.net download page. Instructions are listed for Windows, Mac OS X, and Linux, all for the same source code. Compilation under Linux is straightforward -- run ./configure followed by make -- but there is no make install step. In fact, you must move the resulting tesseract binary into its parent directory, where it expects to find a support directory called tessdata. Make sure the directory is writable, because Tesseract generates temporary files there while processing an image.

more...

--
. . h.o.s.a.m.r.e.d  . .
Unleashed Innovation
--
http://hosamred.blogspot.com

No comments: