Someone asked:
“I am working on a research project that deals with American military casualties during WWII. Specifically, I am attempting to construct a count of casualties for each service at the county level. There are two sources of data here, each presenting their own challenges. 1. Army and Air Force data. 2. Navy and Marine Core data.”
Full question is here: http://datascience.stackexchange.com/
The answer to both data sets is an OCR application with some post-processing, but a more specialized program than a generic low-quality or an open source OCR. Essentially the harder the problem, the more capable and advanced tools need to be used to solve it.
There will be two major stages in this task: generating the data (image to text, i.e. OCR), and processing the data (doing the actual count). Look at them separately in order to select the best method for each stage.
The main challenges in these images and OCR are:
- images have low resolution. For example the # 1 image has resolution of about 72 dpi. Suggested resolution for such text quality is to scan at 300 to 400 dpi, but it is clear that re-scanning or controlling scan resolution is not applicable now. That’s why one option is to clean and increase the size using image pre-processing tools. This is what the original #1 image snippet looks like after adaptive binarization and zoomed at 300%. It is clear that each character has too few pixels and characters can be easily misread.
- GIF format in #1 is not supported by many OCR applications. Images need to be batch-converted to a different format, such as PNG or TIF.
- in these scans the backgrounds and bleed-through (shadow from the text on the other side of the paper) is visible. Good binarization needs to be used to remove background and bleed-through, but not remove vital parts of actual characters.
After implementing specific pre-processing solutions for the items listed above, and then using a high quality OCR system, such as www.ocr-it.com API, highest possible results can be achieved. Result is far from perfect, but it is as high accuracy as it could be achieved with a modern OCR engine on these images.
Luckily for this project, the data needs to be counted, so the second stage has all necessary data for reliable data post-processing analysis. Contrary to other basic OCR engines, the OCR provided by www.ocr-it.com API is returns formatted text layout, including preserving line breaks and overall format structure.
A simple algorithm can be run to count the number of lines, resulting in the necessary for the research count.
The above documents a two-stage approach: getting best possible OCR result, and using an applicable method to process data for the required task
Bat wait, there is more…
There is a second method to use an even more specialized OCR application called FlexiCapture with FlexiLayout technology. This powerful and intelligent data capture technology has built-in high-accuracy OCR, and it has a powerful rules and data analytics engine to perform very specialized user-defined chains of actions and tasks.
The implementation of this method using FlexiCapture with FlexiLayout takes the following logical steps.
First, full page OCR is performed and all objects are extracted, including characters, noise, black horizontal and vertical lines, white gaps, and objects (which could be pictures, logos, handwriting, etc.). This produces objects upon which we can apply our search criteria.
For this task, the following constraints have been applied to the post-OCR data analysis and search criteria: separate image into three vertical columns and run the following logic per column, use line-start as individual count, skip header/footer/indented lines (county names), assume each name to have at least three characters, find recursively every name starting from top to bottom in every column, exclude previously found lines.
While the above logic sounds complex to setup, and there are a few other assumptions that that had to be specified, the actual setup takes just a few minutes and requires minimal work through user interface (UI) environment. No coding pr programming is necessary. The following search elements and criteria have been created.
RepeatingGroup consisting of a CharacterString search object.
This setup produces the following search result for the first column of data.
As the last step, FlexiCapture is instructed to return the number of total found elements that fit our search criteria, effectively producing the necessary data for the research task.
There are other logic alternatives that can be setup in FlexiCapture, such as finding the number of white spaces between lines, or searching for the fixed-length fixed-placement 3-letter combinations at the end of every column.
In conclusion, there are several options (which is always nice) how this task can be achieved with relative ease and high quality, but the success depends on the quality of tools used and necessary knowledge how to use them.