Check-mark accuracy, all or none

by Ilya Evdokimov | Mar 01, 2010 | Uncategorized

Check-mark processing (OMR) is one of the most accurate recognition technologies

Check-mark processing (OMR) is one of the most accurate recognition technologies. Companies who properly utilize OMR are able to process documents quickly and accurately. But for the same reason OMR is accurate, it can also be very inaccurate, when not used properly.

For the most part, OMR is an all or nothing technology. Unlike the varying degrees of accuracy and uncertainty in OCR, with OMR, a field is checked or not. Where accuracy and uncertainty come into play is when you deal with collections of check-marks where the technology will compare the results of all to see whichever ones are most likely checked. The three areas where organizations make the mistake when using OMR is: improper OMR type, poor thresholds, and bad rules.

Many think of OMR fields as the traditional bubble on school tests. But there are several types of OMR fields. Rectangle, Round, Automatic, and White Field. Unlike text recognition, the wrong field type selection in OMR results in 100% incorrect results, most of the time.

Rectangle and round are the traditional fields that comes to mind when thinking of check-marks. The technology used to processes these, also includes a way to tell if a field has been corrected ( slashed out, and answer changed ). For these fields, the borders of the field are detected and when a high enough amount of black pixels is found within the border, the field is considered checked. The only time this will not be the case is when a field has been detected as having a correction.

Automatic field types are for those forms that have non-traditional border types for their fields, OR have some sort of text already existing in the field. For example, if you scan a Scantron form as a black and white image without dropout, you will get for each field a round circle with some letter or number printed in the middle. In this case you would have to use the automatic field type. What happens is that the software compares an EMPTY form to the form being processed. If for example, a field has the letter “A” printed in the middle, the software will count how many pixels in the field the A consist of and use that as a baseline. For a field to be checked, it will have to contain some number of black pixels OVER the baseline. If in this case, you used a rectangle or round check-mark type field, all fields would be considered checked because no baseline was established. Now finally are white fields.

White fields are check-mark fields that have no border. The are most often forms that have dropout scanning or sometimes fields used for unique and cool cases such as detecting signatures. These are a useful type of checkmark that simply expects there to be no border and no printed text in the field area. If there is a small amount of black pixels in the field area it’s considered checked. If you use a white field on a rectangle OMR field it will always be considered checked because of the borders. The biggest challenge for white fields is that the size of the field directly impacts it’s accuracy so proper sizes must be chosen. All check-marks have degrees of thresholds assigned to them.

A threshold is the setting that determines the amount of pixels (as a percent ) that is required before a field is considered checked. Organizations usually never need to toggle the default thresholds, and this is one of the biggest mistakes that is made. Most OMR processing packages have default thresholds for all field types. These vendors have done the research to know what the optimum field threshold is for both accuracy and avoiding false positives. Companies, when they pick the wrong threshold, get fields considered checked when they are not and the other way around. The problem is most of these are never reviewed, because they never get flagged due to custom thresholds which creates a false positive, the worse possible outcome of any exception.

As with all data capture and forms processing tools, there is usually a step of validation and rules. For whatever reason, organizations tend to over-think the rules associated with check-marks. The most common rule is that for any given collection of check-marks associated with a single question, only one or combination of ones can be checked. So for example, for a multiple choice question that asks for one answer, if the software sees two checked it will flag both fields. These rules are very useful but when improperly implemented result in either too much verification of fields, which is OK just a time waster, or like the threshold false positives. Sometimes the rules are applied during recognition and thus effect recognition results. For example, a question that has no answer but one is expected, is forced an answer. It’s easy to blame the software, but most of the time it’s just a bad rule.

OMR is a great tool when used right because it’s extremely fast and accurate, but when it’s used wrong, it’s still fast but just extremely inaccurate.

Chris Riley – Sr. Solutions Architect