Don’t over-clean your scanned images

by Ilya Evdokimov | Dec 04, 2009 | imaging

There is always some way to modify a scanned image to improve it’s recognition results if it’s not already perfect. But there are also ways to modify an image to destroy recognition results. Not all image cleanup is good for OCR for several reasons.

There are two types of image clean-up. First is image clean-up for view-ability. These are the image clean-up tricks that make images look even prettier on the screen, the goal is what is called pixel perfect where the image looks like it was electronically generated. The second is image clean-up for OCR or Data Capture. These are the tricks to making the image gain better recognition results. All image clean-up for OCR and Data Capture is good for view-ability, but not all image clean-up for view-ability is good for recognition. The reason for this is that engines were built and trained during a time where many image clean-up technologies were not available, and because recognition technologies interpret pixels, it’s possible to remove useful ones.

Here are some tips. Stick to certain types of clean-up for recognition when this is the primary purpose. Some products and scanners will even allow what is called “dual stream” where one scanned image produces two results that can go separate paths. If you have this function use settings for one of the images that are best for OCR and settings for the other that are best for view-ability. Good for OCR is:

1.) Despeckle ( unless dot-matrix font )
2.) Line Straightening
3.) Basic Thresholding
4.) Background removal
5.) Correction of Linear Distortion
6.) Dropout
7.) Line Removal ( sometimes )

Bad for OCR is:

1.) Adaptive Thresholding: Often causes a condition called “Fuzzy Characters”. “c”’s will be “e”’s. For hand-print you often remove portions of characters.

2.) Character Regeneration: Removes critical information important to OCR and ICR processes. If you use it in OCR ( Machine-Print ) you will notice more “high confidence blanks”, the characters are so perfect they look like images to the OCR engine and are ignored. In ICR ( Hand-Print ) you will damage the hand stroke of the characters thus confusing the ICR algorithms and reducing trainings ability to understand the subject and this ultimately reduces accuracy.

3.) Line Removal: Bad line removal makes bad OCR. Line fragments really interfere with OCR and ICR processes.

When using imaging for OCR and Data Capture processes consider only those that improve the recognition rates, not destroy them.

Chris Riley – Sr. Solutions Architect