Text Parsing vs. Dynamic Data Capture
I answered this question on StackOverflow, and it was too important not to duplicate here
I am extracting texts from OCRed Tiff files by using a library and dumping it in database. The text I am extracting are actually FORMS having fields like NAME, DOB, COUNTRY etc. Since OCR does not know the difference between actual value and the label, it’s just dumping all text. Now I have text in DB in following format:
Name: MyName Address: My Address
Now the next step is to extract values lile MyName and MyAddrss from the DB. The document types may vary hence a generic parser might not work.
What would you suggest to do in this situation? Should I write different parsers? I am working on .NET
Hello. This is a common question for which an OCR industry found a generic solution years ago, and the solution branches into two separate directions. Using OCR for form processing, otherwise known as data extraction, can be one of the following two methods.
TEXT PARSING – considered as an old approach that still works in many situations. Obviously you are experienced in that and know the pros and cons, so I will be brief here. Pros is that it requires no other technology, just generic programming. Cons are that a) it requires programming, b) not very adaptive to variations, c) if formatting changes overtime may have to deal re-write some spagetti or legacy code, and d) requires near-perfect OCR result in order to find data successfully (i.e. mis-recognized label may result in missing data). In other words, great for quick and simple solutions, but not too adaptive to variations and changes. Have done it a lot back in my school and early programming days.
DYNAMIC DATA CAPTURE – using some special technology to dynamically locate data. Some technologies do it on the image-level and feed clean data to your database. Other technologies do it on the post-OCR text level. I am most familiar with data capture on image level, as it has several key benefits for complex projects I have done, so I will talk more about that. Only con is that you may need to invest into a specialized software tool, but that is a tool that provides a lot of benefit. Even a plumber has to invest into tools to do his job. The benefit of image-based data extraction is that post-OCR text is not always perfect, so the text-based extractor has to accommodate for mistakes, something that an old text parsing approach cannot. Also, in text parsing you can use only text, while in image parsing you have a ton of other information, such as lines (like in table columns), white gaps between texts (such as paragraph separators), pictures, logos, checkboxes, etc.
For example, I heavily use ABBYY FlexiCapture for these types of extraction (https://www.wisetrend.com/abbyy_flexicapture.shtml). That tool allows me to define what data I need to extract and how it should be extracted. For example, you would do something like this:
- Identify the format style, if more than one. If you have multiple formats, you can apply a different set of extraction rules per format.
- Locate label “Name:” or some other variation of it using fuzzy search or rules to accommodate OCR mistakes if any. Look in a certain area if more than one name occurs on the page
- Locate the area that contains chars of certain type next to the found label Name. Those chars have to fit certain criteria to be accepted as MyName field, and all those criteria are defined through UI (or scripting if you want).
- OCR the area content with MyName chars. Another benefit here is that you no longer use a generic OCR. You can use a very specific OCR settings that apply only to your MyName area – which increases the accuracy of OCR and data. This is most useful for specialized data, such as part numbers, codes, addresses, etc. You can use regular expressions, dictionaries, rules. You can be specific per field. That is not possible when full page OCR is used.
- Send the clean data to DB. Before you send the data, if you want to guarantee OCR quality, most tools usually have some kind of Verification capability to visually check (requires a human) OCRed text against the image.
In general, setting up these processes is much quicker and more liberating than code-based text parsing. There is plenty of scripting and APIs available for those who want to go past UI or need additional automation.
I scratched the surface, but hopefully that provides a start for your research and decision. If I have not addressed anything, please feel free to let me know.
Ilya Evdokimov, Data Capture Expert for 10+ years, CDIA+ Certified
My blog with more data capture stuff is here: https://www.wisetrend.com/ocr_and_data_capture_blog/