On Nov 2, 2009, at 11:32 AM, Craig White wrote: > On Mon, 2009-11-02 at 08:31 -0700, Matt Graham wrote: >> I spent 3 or 4 years doing stuff like this on the NYT, Wall Street >> Journal, Christian Science Monitor, and Boston Globe. You will NOT >> be able to get decent OCR with free software. Newspapers require >> a different approach than most OCR packages take; you have to split >> each article up into multiple individual image files and OCR each >> file separately, then stitch the results back together. And editing >> the results is totally necessary since newspaper text is so horrible >> in quality. > ---- > I don't know anything about GOCR at all. > > A few years ago I set up tesseract and it worked as well as I have > seen > any OCR program work (in terms of accuracy) though clearly there are > many limitations compared to something like Omnipage. In the end it > was > rather easy to install and get it working. > > http://code.google.com/p/tesseract-ocr/ Google uses tesseract in their ocropus project. Ocropus seems promising, but is still at a fairly early stage. http://code.google.com/p/ocropus/ alex