On Mon, 2009-11-02 at 08:31 -0700, Matt Graham wrote: > >From: Alex Dean > > On Nov 1, 2009, at 9:24 PM, Ted Gould wrote: > >> I'd recommend gscan2pdf. It works with SANE, but does nice things > >> like handle double sided stuff easily. It will also work with > >> GOCR to do OCR > > That's not exactly a great thing. GOCR is much worse than commercial > OCR engines, especially if the original image is skewed/broken. Oh, well, it's plugable, just GOCR is all I have :) In general, it does a bunch of cleanup to make the OCR reasonable. I wouldn't say it's anywhere near perfect, but it seems to pull most of the keywords out of things like credit card statements. I would say it's good enough for search, but it's not perfect by any stretch of the imagination. > > Someday I'm going to start digitizing and OCR-ing the 100 years of > > local newspapers which are gathering mold in the library basement. I > > really have no firm plan as to how I'm going to do it, but doing it > > with free software would be a big plus. > > I spent 3 or 4 years doing stuff like this on the NYT, Wall Street > Journal, Christian Science Monitor, and Boston Globe. You will NOT > be able to get decent OCR with free software. Newspapers require > a different approach than most OCR packages take; you have to split > each article up into multiple individual image files and OCR each > file separately, then stitch the results back together. And editing > the results is totally necessary since newspaper text is so horrible > in quality. > > (I can talk about this for at least half an hour; contact offlist > for more info.) +1, I wouldn't use it for archival things like that yet. But, you might be able to use GOCR with the work Google is doing -- I'm not sure if they're open sourcing all of it or not. --Ted