How many of us have tried to get a hold of reliable optical character recognition (OCR) software, only to be stumped by buggy and incomplete code in the free or cheap tools, and ridiculous pricing on commercial packages? If you've been bitten by the digitization bug—like me—your days of frustration may be over, and once again it's Google to the rescue.
Tesseract is not only a cooler name for a hypercube, but also an OCR engine originally written by Hewlett-Packard engineers. After ten years of development, it held its own in a comprehensive comparison test (PDF) of OCR solutions back in 1995. In that test, it ran neck-and-neck with the best option available at the time, which was the precursor to Caere ScanSoft Nuance OmniPage. Unfortunately, HP gave up on the OCR market that year, and Tesseract was shelved for ten years.
Last year, HP and UNLV figured that the package would probably do more good in the wild than on a warehouse shelf, and released the code to the public. But despite its once-respectable performance, it was very buggy and clearly not designed to run on modern hardware. Google stepped in, took a look, and sicced a few engineers on fixing up the most obvious bugs. In June this year, the company figured that Tesseract was stable enough for public release, and so it was rereleased as a SourceForge project under the Apache license, version 2.0.
|