What you OCR is what you get

by Ilya Evdokimov | Mar 02, 2010 | OCR

Often the purpose of doing Optical Character Recognition ( OCR ) for individuals and companies is to get a digital version of a document where the individual intends to edit and or re-purpose. This is not the most common use of the technology but a use that requires specific attention.

In order to convert a document so that it is printable later on, it’s important to not only get the text from the document but also the format of the text. This includes layout as well as things such as graphics, and font colors. To do this, the OCR product must be able to recognize colors (requires color scanning), recognize font styles, and very importantly, recognize document structure.

Engines that support advanced document analysis have this. Document analysis ( DA ) is the process that happens before any text is read on a page. Document analysis makes sense of a document in order to improve recognition as well as get the formatting required for a formatted export. First, document analysis finds document structured, ie. columns, tables, text, paragraphs lines. Once this is done, it identifies colors in text and graphics. After document analysis has done it’s job, the recognition can begin. During recognition, the style of fonts is detected: bold, italic, underlined. All of this is put together with a result formatted as close as possible to the input document.

For those individuals that are concerned about the re-purposing of their documents, a straight text OCR engine will not work. Basic OCR engines get the text on the document in digital form and nothing more. For these individuals, it’s important to find a solution that has good documenting analysis.

Chris Riley – Sr. Solutions Architect