Documents containing tables have the majority of information of the document printed thus the demand to collect this data is very high. In data capture organizations will choose three scenarios to collect data from these documents; ignore the table, get the header and footer and just a portion of table, or get it all. Ideally organizations prefer the last option, but there are some strategic decisions that have to be made prior to any integration using tables. One of those decisions is whether to capture the data in the table as a large body of individual fields or as a single table block. Lets explore the benefits and downside to both.
Why would you ever perform data capture of a table with a large collection of individual fields when you can collect it as a single table field? Accuracy. Theoretically it will always be more accurate to collect every cell of a table as it’s own individual field. The reason for this is because you will accurately located field, remove risk of partially collected cells or cells where the base line is cut, and remove white space or lines from fields. In some data capture solutions this is your only choice. Because of this many have made it very easy to duplicate fields and make small changes so the time it takes to create so many fields is faster. This is a great tool because the downside to tables as a collection of individual fields is in the time it takes to create all fields and maybe this is too great to justify the increase in accuracy.
If you have the ability in your data capture application to collect data as an individual table block, you are able to very quickly do the setup for any one document type. Table blocks require document analysis that can identify table structures in a document. The table block relies heavily on identified tables and then applies column names per the logic in your definition. This is what creates its simplicity but also its problems. Sometimes document analysis finds tables incorrectly, more often partially. This can cause missing columns, missing rows, and the worse case scenario rows where the text is split vertically between two cells or horizontally cutting columns in half.
There is a varying complexity in the tables out there, and this most often is the deciding factor of which approach to take. Also very often the accuracy required, and the amount of integration time to obtain that accuracy determines the approach. For organizations that want line-items, but they are not required, table blocks are ideal. For organizations needing high accuracy and processing high volume, individual fields are ideal. In any case, it’s something that needs to be decided prior to any integration work.
Chris Riley – Sr. Solutions Architect
Ilya Evdokimov is a long-term practitioner and expert in leading Optical Character Recognition (OCR), Data Capture and Document Processing techniques, technologies and solutions. With over 15 years of experience spanning enterprise software implementations, mobile applications development, cloud-based systems integration and desktop-level automation, Ilya Evdokimov uses through industry knowledge and experience to achieve high efficiency and workflow optimization in most challenging paper-dependent and digital image capture environments.