On Mon, 2009-11-02 at 08:31 -0700, Matt Graham wrote:
> >From: Alex Dean <alex@crackpot.org>
> > On Nov 1, 2009, at 9:24 PM, Ted Gould wrote:
> >> I'd recommend gscan2pdf.  It works with SANE, but does nice things  
> >> like handle double sided stuff easily.  It will also work with
> >> GOCR to do OCR
> 
> That's not exactly a great thing.  GOCR is much worse than commercial
> OCR engines, especially if the original image is skewed/broken.

Oh, well, it's plugable, just GOCR is all I have :)

In general, it does a bunch of cleanup to make the OCR reasonable.  I
wouldn't say it's anywhere near perfect, but it seems to pull most of
the keywords out of things like credit card statements.  I would say
it's good enough for search, but it's not perfect by any stretch of the
imagination.

> > Someday I'm going to start digitizing and OCR-ing the 100 years of  
> > local newspapers which are gathering mold in the library basement.  I  
> > really have no firm plan as to how I'm going to do it, but doing it  
> > with free software would be a big plus.
> 
> I spent 3 or 4 years doing stuff like this on the NYT, Wall Street
> Journal, Christian Science Monitor, and Boston Globe.  You will NOT
> be able to get decent OCR with free software.  Newspapers require
> a different approach than most OCR packages take; you have to split
> each article up into multiple individual image files and OCR each
> file separately, then stitch the results back together.  And editing
> the results is totally necessary since newspaper text is so horrible
> in quality.
> 
> (I can talk about this for at least half an hour; contact offlist
> for more info.)

+1, I wouldn't use it for archival things like that yet.  But, you might
be able to use GOCR with the work Google is doing -- I'm not sure if
they're open sourcing all of it or not.

		--Ted