On Mon, 2009-11-02 at 08:31 -0700, Matt Graham wrote: > I spent 3 or 4 years doing stuff like this on the NYT, Wall Street > Journal, Christian Science Monitor, and Boston Globe. You will NOT > be able to get decent OCR with free software. Newspapers require > a different approach than most OCR packages take; you have to split > each article up into multiple individual image files and OCR each > file separately, then stitch the results back together. And editing > the results is totally necessary since newspaper text is so horrible > in quality. ---- I don't know anything about GOCR at all. A few years ago I set up tesseract and it worked as well as I have seen any OCR program work (in terms of accuracy) though clearly there are many limitations compared to something like Omnipage. In the end it was rather easy to install and get it working. http://code.google.com/p/tesseract-ocr/ At one time I considered trying to glue it into something like Alfresco for document management but that seemed to be difficult and at this point, I would probably just write a wrapper program with ruby. Depending upon what the OP is looking for in document management, Alfresco might just be the ticket. http://www.alfresco.com/ Craig -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. --------------------------------------------------- PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us To subscribe, unsubscribe, or to change your mail settings: http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss