What is OCR?

by Ilya Evdokimov | Sep 24, 2019 | Accuracy

Optical mark recognition

OCR-Optical mark recognition (also called optical mark reading and OMR) is the process of capturing human-marked data from document forms such as surveys and tests. They are used to read questionnaires, multiple-choice examination paper in the form of lines or shaded areas.

Accuracy is highly dependent on document type, quality of the scan, and document makeup. How do we give OCR the best chance to succeed? There are many ways, what I would like to talk about now is quality assurance.

Quality assurance is usually the final step in any OCR process where human reviews uncertainties, and business rules based on the OCR result. An uncertainty is a character that the software flags that did not during recognition satisfy a threshold. This process is a balancing act between a desire to limit as much human time as possible and a need to see every possible error but not more.

Starting with a review of uncertainties. Here an operator will look at just those characters, words, sentences, that are uncertain. This is determined by the OCR product which will have some indicator of what they are. In full-page OCR often spell checking is used. In Data Capture usually, a review character-by-character of a field is done and you don’t see the rest of the results. Some organizations will set critical fields to be reviewed always no matter the accuracy. Others may decide that a field is useful but does not need to be 100%. Each package has its own variation of “verification mode”. It’s important to know their settings and the levels of uncertainty your documents are showing to plan your quality assurance.

After the characters and words have been checked in Data Capture there is an additional step in quality assurance, business rules. In this process the software will apply arbitrary rules the organization creates and check them against the fields, a good example might be “don’t enter anyone in the system whose birth year is earlier than 1984”. If such a document is found it is flagged for an operator to check. These rules can be endless, and packages today make it very easy to create custom rules. The goal would be to first deploy business rules you have already in place in the manual operation and augment it with rules to enhance accuracy based on the raw OCR results you are seeing.

In some more advanced integrations, the use of a database or body of knowledge is deployed as first-round quality assurance that is also still automated.

These two quality assurance steps combined should give any company a chance to achieve the accuracy they are seeking. Companies who fail to recognize or plan for this step are usually the ones that have the biggest challenges using OCR and Data Capture technology.