When it comes to forms processing and data capture, working with documents that have hand-print vs. handwriting is a huge difference in accuracy and validity. Sometimes the difference between these two is not so clear. So how do you tell if your form is hand-print, or handwriting, or better yet both!
ICR ( Intelligent Character Recognition ) is the algorithm used in the place of OCR for characters generated by a human hand. The algorithm is more dynamic as a persons hand-print changes slightly by the minute. It’s possible to be very accurate when processing hand-print forms when the form is designed correctly. When doing this type of forms processing you will always have quality assurance steps, but you can get close to the accuracy of any OCR process. Very often forms that were not created with data capture or automatic extraction in mind will contain handwriting. The reason for this is that hand-print is usually guided by the form itself. Forms without hand-print cannot expect to be processed at a high accuracy. So what makes hand-print, hand-print?
Mono-spaced text: What this means that each character as it’s filled out is the same distance apart as all the other characters. In handwriting very often you will have characters that connect, in the extreme form this is cursive. When characters touch or are not spread out equally you get improper segmentation and get characters clumped together as one or split in half during recognition. Mono-spaced text is usually achieved using boxes on the form guiding the user to fill within the boxes.
Uniform Height and Width: Similar to mono-spaced text the text as it is filled in should have a more or less uniform height or width. This forces the completer to not introduce as many variable elements as they would in straight handwriting and increases accuracy. This is also accomplished using boxes on the form keeping users within boundaries.
Stable Base-Line: This aspect of hand-print is the lessor thought about but very important. Text must always be on the same horizontal base-line. What happens typically in handwriting is a user varies up and down on an invisible baseline. You may have noticed sometimes when you write that the end of any line is lower then the beginning. Baselines are important for OCR and ICR to get proper character segmentation and recognition of a few key characters such as “q” and “p” the “tail” characters.
Sans-serif: The last element is keeping characters sans-serif. The reason for this is the extra tails to characters can cause confusion between certain characters like “o” vs. “q” and “c” vs. “e”. The way to achieve this is less obvious, it is by putting a guide on the top of the form that shows a good character and a bad character.
ICR is a technology for Hand-print recognition and can be very accurate when having the proper guides. Today handwriting and cursive automation is not complete and usually only successful when augmented with other technologies such as data base look-up and CAR and LAR. Sometimes the difference between the two is unclear, but the above 4 elements provide a clear definition of hand-print. The best hand-print that can be found is by the highly training creators of engineering drawings whose print is so perfect it resembles very closely typographic text.
Chris Riley – Sr. Solutions Architect
Ilya Evdokimov is a long-term practitioner and expert in leading Optical Character Recognition (OCR), Data Capture and Document Processing techniques, technologies and solutions. With over 15 years of experience spanning enterprise software implementations, mobile applications development, cloud-based systems integration and desktop-level automation, Ilya Evdokimov uses through industry knowledge and experience to achieve high efficiency and workflow optimization in most challenging paper-dependent and digital image capture environments.