OCR Page numbers and detect double feed

Submitted by brad on Fri, 2008-05-09 01:14

Topic:

Tags:

I'm scanning my documents on an ADF document scanner now, and it's largely pretty impressive, but I'm surprised at some things the system won't do.

Double page feeding is the bane of document scanning. To prevent it, many scanners offer methods of double feed detection, including ultrasonic detection of double thickness and detection when one page is suddenly longer than all the others (because it's really two.)

There are a number of other tricks they could do, I think. I think a paper feeder that used air suction or gecko-foot van-der-waals force pluckers on both sides of a page to try to pull the sides in two different directions could help not just detect, but eliminate such feeds.

However, the most the double feed detectors do is signal an exception to stop the scan. Which means work re-feeding and a need to stand by.

However, many documents have page numbers. And we're going to OCR them and the OCR engine is pretty good at detecting page numbers (mostly out of desire to remove them.) However, it seems to me a good approach would be to look for gaps in the page numbers, especially combined with the other results of a double feed. Then don't stop the scan, just keep going, and report to the operator which pages need to be scanned again. Those would be scanned, their number extracted, and they would be inserted in the right place in the final document.

Of course, it's not perfect. Sometimes page numbers are not put on blank pages, and some documents number only within chapters. So you might not catch everything, but you could catch a lot of stuff. Operators could quickly discern the page numbering scheme (though I think the OCR could do this too) to guide the effort.

I'm seeking a maximum convenience workflow. I think to do that the best plan is to have several scanners going, and the OCR after the fact in the background. That way there's always something for the operator to do -- fixing bad feeds, loading new documents, naming them -- for maximum throughput. Though I also would hope the OCR software could do better at naming the documents for you, or at least suggesting names. Perhaps it can, the manual for Omnipage is pretty sparse.

While some higher end scanners do have the scanner figure out the size of the page (at least the length) I am not sure why it isn't a trivial feature for all ADF scanners to do this. My $100 Strobe sheetfed scanner does it. That my $6,000 (retail) FI-5650 needs extra software seems odd to me.

Share on: