Get the “Blank” out of here

by Ilya Evdokimov | Jan 21, 2010 | Accuracy

One of the challenges of document scanning is pesky blank pages

One of the challenges of document scanning is pesky blank pages. They are usually an annoyance and a space taker more then a real problem. Blank pages in PDF files cause needless scrolling, and in text documents make you believe something has been missed. In the area of duplex scanning you can be assured that unless you take the steps to remove blank pages they will be there. There are several ways to zap blank pages from any batch scan. You can remove them prior to scanning, during scanning, or after recognition / post scan. Obviously the task of removing blank pages prior to scanning is only possible if they are two sided blanks or if you selectively scan simplex and duplex depending on the document. This is cumbersome and takes up a lot of time.

Most document scanners today include as apart of their driver a blank page removal tool. These tools vary slightly they may have specific algorithms that detects blank pages not only by the amount of white on the page but also possibly by how a page relates to other pages in the batch. Some times this is problematic when you have backsides of documents with very little text. The other approach is to measure the resulting image file size, under a certain number of kilobytes you can likely spot a blank page, this has the same problem of removing pages that have very little text which often occur on the back side of documents. The final and most accurate way is to measure the amount of black or color pixels on the page and set a threshold at a small percent like 1% or 2% that could consider the page blank, this approach is the most accurate but requires you to know your documents beforehand and may be problematic with greyscale scans or contrast settings that make blank pages slightly gray. The other approach would be to have imaging or OCR software remove the pages for you.

Some, not most OCR applications have the ability to also detect blank pages, they use a combination of pixel detection and the presence of text. This might slow down your OCR process but is a useful tool if it is available. More likely you can purchase a full-on imaging application that has very robust blank page removal tools akin to what you would find in a scan driver but usually with more options.

Organizations such as service bureaus often combine methods to ensure that no blanks make it through. Blank page detection tools are very accurate and very useful that you can start using today.

Chris Riley – Sr. Solutions Architect