Already digital but still OCRed

by Ilya Evdokimov | Mar 04, 2010 | Uncategorized

we use OCR to read the already digital files as images

I’ve faced unique projects in the last four years and in a few, the best approach even seemed to contradict my better logic. The projects I’m talking about are ones where the data we were working with was already in a digital format, namely a PDF file that was created digitally. What this meant was that all the text in the PDF was available and 100% accurate. So why then, to accomplish the project’s goals, did we use OCR to read the already digital files as images?

I had intended for all these projects to do a logical parsing of the already digital content so I can get what I want. The problem is that even though the internal structure of the PDF has a logical standard, it’s not used logically 90% of the time by most PDF generating applications. PDF has in it a tolerance for mistakes that allows organizations to deviate quite drastically from the standard. What this means is that not only is the content in each PDF unique per company that generates it, it’s unique per number of applications able to create them. Variations on-top of variations makes logical parsing very difficult. This becomes most obvious when the documents contain tables. Because of this the only way to text parse the PDF properly would be to flatten the internal logic so that they consist of nothing but text, but by doing so you lose some of the information pointing to where tables are and their structure.

You may have guessed by now that all my projects were to parse tables from PDF. Not just any table but specific tables in PDFs where each was a unique format. As I said before, my preference would have been to use the 100% accurate data already in the PDF. In the end what I ended up doing was OCRing the PDFs because they were what is called “pixel perfect” so the accuracy was very high. Now that I was using OCR, I was able to first recognize an entire document and remove everything that was not a table which was determined by my OCR document analysis. Then I was able to use keywords to find the specific table that I wanted. The end result took me about 3 weeks of work for each project, and the result was higher accuracy in table finding, and only slightly less accurate in the text values than a table parsing.

While it seemed most logical to do the parsing, in the end I saved over 5 man-months of work by using OCR.

Chris Riley – Sr. Solutions Architect