Occasionally the need to convert large documents such as maps and engineering documents comes along. Many times the OCR requirement is limited to a small subset of fields and clearly defined, but when it comes to converting the entire document to get as much text as possible there are many things you need to consider.
First is if you already have the ability to scan or are receiving images of large format drawings congratulations, as this can be one of the biggest challenges. Scanning large format documents requires either a large format scanner, or stitching of partial scans ( less preferred ). Because these documents have small fonts it’s important to scan at 300 to 400 DPI. For maps because of the amount of graphics, drop-out of all colors would be ideal or a thresholded black and white scan where you are left with mostly only text in the image.
The purpose of OCR for most of these documents is for index and search-ability, so the goal is to get as much possible text as you can. For maps with a good scan you should be able to get the majority of the text except for names printed on a curve. Running line straightening on these might work but more likely hurt the recognition of the rest of the map so I would recommend avoiding it. Prior to OCR set your OCR engine to disable auto-rotate because there are a lot of things on these documents that can cause a mis-rotation namely text printed in every direction.
Now to the secret, it has to do with rotation. Depending on the setup of the drawing or map if you OCR the document at every 90 degrees, once completing a full 360 degrees will have the majority of the text. That is right, I’m suggesting that you OCR the document 4 times, hopefully in an automated fashion. Now this might leave you thinking that you will end up with a lot of garbage, and you are right. But what you can simply do with the final OCR result is use a dictionary to remove all garbage text.
The end result is a map or drawing with the most amount of index level text possible. I admit that I made it sound a little easier then it is, and most likely you will require an API to get the full job done, but the possibility exists and it’s been proven successful.
Chris Riley – Sr. Solutions Architect
Ilya Evdokimov is a long-term practitioner and expert in leading Optical Character Recognition (OCR), Data Capture and Document Processing techniques, technologies and solutions. With over 15 years of experience spanning enterprise software implementations, mobile applications development, cloud-based systems integration and desktop-level automation, Ilya Evdokimov uses through industry knowledge and experience to achieve high efficiency and workflow optimization in most challenging paper-dependent and digital image capture environments.