Color Scanning and Recognition – “No text left behind”

by Ilya Evdokimov | Nov 24, 2009 | Accuracy

OCR technology has come a long way since it’s creation. On the 300 DPI ( Scan Resolution ) clean, letter type documents the technology has arrived and not much room for improvement. But what about the rest of the documents out there, how is OCR improving on them? When comparing that perfect letter document to that not so perfect article or newspaper say, the big difference is text placement and configuration. One of the keys to getting even better OCR is to improve your ability to identify what is graphics, what is text. Within the text you have to identify columns, paragraphs, sentences, words, and finally characters. Only then can the OCR take a whack at interpreting the text. This is called Document Analysis. Sometimes OCR accuracy is lower not because of the actual read of the text but because the OCR software tries to read things that are not text, or some of the text in the document is simply ignored because it was never found.

In the last few years and moving forward text identification, Document Analysis, has been one of the areas of greatest improvement. Many of the new products have been leveraging color as one more tool in not leaving any text behind. With color the ability to locate different parts of a document is even easier and more accurate, thus the overall OCR is more accurate. The most obvious benefit of color is ability to locate graphics. Sometimes index level OCR requires that even text within graphics be read to enhance the search-ability of a document. With color detection the modern engines are advancing to locate text in pictures and ignore the rest. Very stylized documents pose the greatest challenge to Document Analysis, and color is one of the best tools to attack them. Expect to see similar trends and focus on Document Analysis and the pursuit of no text left behind.

Chris Riley – Sr. Solutions Architect