I find myself educating even industry peers on the topic of document type structure more and more recently. Often the conversation starts with one of them telling me about how unstructured document processing exists, OR the fact that a particular form is fixed when it is not. Understanding what is meant when talking about document structure is very important.
First lets start with defining a document. A document is a collection of one or many pages that has a business process associated with it. Documents of a single type can vary in length but the content contained within or the possibility of it existing is constrained. When data capture technology works, it works on pages, so each page of a document is processed as a separate entity and this it seems, is the meat of the confusion.
Often someone will say a document is unstructured. What they are thinking of is that the order of pages is unstructured, this is more or less accurate, however the pages within this unstructured document are either fixed or semi-structured. The only truly unstructured documents that exist are contracts and agreements. How you know this is that if at any moment in time you pull a page from the document and state what that page is and what information it would have, then it IS NOT unstructured.
The ability to process agreements and contracts is very limited in very concrete scenarios, where the contract variants are non-existent which essentially also makes them unstructured. In general the ability to process unstructured documents does not exist. Now to explore the difference between semi-structured and fixed.
It’s actually very easy because 80% of the documents that exist are semi-structured. Even if a field appears in the same general location on every page of a particular type, it does not make it fixed. For example, a tax form always has the same general location to print the company name. The printer has to print within a specified range. They can print more to the left, more to the top, and the length will very with every input name. This makes it semi-structured and additionally this document when it is scanned will shift left , right, up, down small amounts. A document is ONLY truly a fixed form when it has registration marks and fields of fixed location and length. Registration marks are how the software matches every image to the same set of coordinates making it more or less identical to the template.
There again the confusion is exposed. It’s very important to understand when having conversations about data capture to understand the true definitions of the lingo that is used. I task you, if you catch someone using the lingo incorrectly, it will help you and them to correct it.
Chris Riley – Sr. Solutions Architect
Ilya Evdokimov is a long-term practitioner and expert in leading Optical Character Recognition (OCR), Data Capture and Document Processing techniques, technologies and solutions. With over 15 years of experience spanning enterprise software implementations, mobile applications development, cloud-based systems integration and desktop-level automation, Ilya Evdokimov uses through industry knowledge and experience to achieve high efficiency and workflow optimization in most challenging paper-dependent and digital image capture environments.