For all of you out there with theoretical and practical experience, what is the best hardware and server types for large-volume OCR conversion.  What type of processors and other resources have you found most effective and why?  Please give a short description of your OCR work and loads to illustrate your feedback.  Thanks for your feedback.

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

Here is some early stage feedback form recent two implementations of WiseINVOICE for AP data capture automation:

Todd Gruff, B & G Manufacturing Inc.’s lead for FlexiCapture ingratiation, said WiseTrend solution and services were top notch.  “We were very impressed with ABBYY FlexiCapture’s ability to capture the data we required.  The software is easy to use and, with the terrific support from WiseTrend, the software’s ability to be adapted to different variations of Purchase Orders is limitless”.  B & G Manufacturing Inc. is a world-renowned manufacturer and supplier of machined parts.  The company receives high volumes of Purchase Orders that require accurate and efficient processing and data gets imported into SAP.

IBT, Inc. is a well-known wholesale industrial supplier interested to optimize and streamline Invoice processing and storage.  “After starting our data capture project with the industry’s standard approach, we switched to a more predictable approach using WiseTrend methodology.  After numerous testing this approach has proven to be the best way.  The single complex template would not produce reliable results on different variations.  There was too much time required filling in missed fields and even double checking all seemingly successful fields (aka false positives).  Using this new methodology we can control and rely on the data capture result.  Even when vendors are consistently changing their invoice format, which would cause serious complications and a major need for professional services in the past, using WiseTrend’s method it is quick and efficient to make the change.”  Kevin Thompson and Randy Bledsoe are accountants at IBT, Inc., who now run and maintain their own full data capture system without any IT involvement.  Captured data and images get exported to DocuWare.  The project was led by Toshiba Business Solutions, a major copier integrator and ABBYY VAR.

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

We are proud to announce our new method of servicing invoice processing and AP data capture automation needs.  WiseTrend WiseINVOICE implementation combines best modern technologies, proven methodology, and user-friendly training focusing on user self-sufficiency.  This solution is part of a growing trend to eliminate runaway costs for professional services while obtaining a reliable AP automation system through a truly customizable fully functional solution with immediate and accurately calculable ROI.

Up until now, meticulous programming or applying the “one-size-fits-all” approach for Invoice processing were only solutions in the industry.  This often misleading approach led customers to projects outside of budgets, skyrocketing professional service cost, long integration period, frustration, and inability to see clear return on the technology investment.  As the quantity of document variations increases linearly, the complexity of the system and associated setup increases exponentially.

With over 10 years of data capture experience, we developed a solution that works to optimize the technology and minimize production labor.  Our programmers previously spent weeks un-tangling projects that were based on the current industry lure—a one-size-fits-all solution for invoice processing.  Software manufacturers promise a single solution as a magic pill for all template variations.  Contrary to the promise, we spent many professional service days running regression testing on complex implementations for pre-setup AP projects.  Over many projects, we streamlined the Invoice and Purchase Order project setup to become a consistent and repeatable process.

Our approach provides tangibles that are surprisingly uncommon in the industry, including predictable timeframe for achieving customer’s return on investment as well as removing the often unpredictable run-away professional service costs and deadline extensions.  Following the proven formula, customers can anticipate the accurate calculation of time and effort it would require to build an effective data extraction system for a particular quantity of variations.

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

I answered this question on StackOverflow, and it was too important not to duplicate here

QUESTION
=================

I am extracting texts from OCRed Tiff files by using a library and dumping it in database. The text I am extracting are actually FORMS having fields like NAME, DOB, COUNTRY etc.   Since OCR does not know the difference between actual value and the label, it’s just dumping all text. Now I have text in DB in following format:

Name: MyName Address: My Address

Now the next step is to extract values lile MyName and MyAddrss from the DB. The document types may vary hence a generic parser might not work.

What would you suggest to do in this situation? Should I write different parsers? I am working on .NET

ANSWER
=================

Hello. This is a common question for which an OCR industry found a generic solution years ago, and the solution branches into two separate directions. Using OCR for form processing, otherwise known as data extraction, can be one of the following two methods.

TEXT PARSING – considered as an old approach that still works in many situations. Obviously you are experienced in that and know the pros and cons, so I will be brief here. Pros is that it requires no other technology, just generic programming. Cons are that a) it requires programming, b) not very adaptive to variations, c) if formatting changes overtime may have to deal re-write some spagetti or legacy code, and d) requires near-perfect OCR result in order to find data successfully (i.e. mis-recognized label may result in missing data). In other words, great for quick and simple solutions, but not too adaptive to variations and changes. Have done it a lot back in my school and early programming days.

DYNAMIC DATA CAPTURE – using some special technology to dynamically locate data. Some technologies do it on the image-level and feed clean data to your database. Other technologies do it on the post-OCR text level. I am most familiar with data capture on image level, as it has several key benefits for complex projects I have done, so I will talk more about that. Only con is that you may need to invest into a specialized software tool, but that is a tool that provides a lot of benefit. Even a plumber has to invest into tools to do his job. The benefit of image-based data extraction is that post-OCR text is not always perfect, so the text-based extractor has to accommodate for mistakes, something that an old text parsing approach cannot. Also, in text parsing you can use only text, while in image parsing you have a ton of other information, such as lines (like in table columns), white gaps between texts (such as paragraph separators), pictures, logos, checkboxes, etc.

For example, I heavily use ABBYY FlexiCapture for these types of extraction (http://www.wisetrend.com/abbyy_flexicapture.shtml). That tool allows me to define what data I need to extract and how it should be extracted. For example, you would do something like this:

  1. Identify the format style, if more than one. If you have multiple formats, you can apply a different set of extraction rules per format.
  2. Locate label “Name:” or some other variation of it using fuzzy search or rules to accommodate OCR mistakes if any. Look in a certain area if more than one name occurs on the page
  3. Locate the area that contains chars of certain type next to the found label Name. Those chars have to fit certain criteria to be accepted as MyName field, and all those criteria are defined through UI (or scripting if you want).
  4. OCR the area content with MyName chars. Another benefit here is that you no longer use a generic OCR. You can use a very specific OCR settings that apply only to your MyName area – which increases the accuracy of OCR and data. This is most useful for specialized data, such as part numbers, codes, addresses, etc. You can use regular expressions, dictionaries, rules. You can be specific per field. That is not possible when full page OCR is used.
  5. Send the clean data to DB. Before you send the data, if you want to guarantee OCR quality, most tools usually have some kind of Verification capability to visually check (requires a human) OCRed text against the image.

In general, setting up these processes is much quicker and more liberating than code-based text parsing. There is plenty of scripting and APIs available for those who want to go past UI or need additional automation.

I scratched the surface, but hopefully that provides a start for your research and decision. If I have not addressed anything, please feel free to let me know.

Ilya Evdokimov, Data Capture Expert for 10+ years, CDIA+ Certified

My blog with more data capture stuff is here: http://wisetrend.com/ocr_and_data_capture_blog/

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

For years we have been polishing one of the most demanded and demanding areas of data capture and OCR – AP department automation.  Processing and automation of numerous variations of different Invoices, Purchase Orders and Agreements is still one of the larger data capture industry’s challenges, but we are proud to offer our proven solution for this task.

About our Invoice and Purchase Order Data Capture and Processing Approach: Invoices are considered some of the more complex documents.  Luckily the technology is capable enough today, no more tedious text parsing necessary, and there is a set of proven methods.  Over the years we have gone through numerous projects and method revisions of setting up those projects, and today I believe we have most balanced method of needed efforts and achieved capabilities through utilization of latest software features.  We bypass the single template approach, which in the past proved to be an unpredictable trap of professional services.  Today we have a repeatable and easily quantifiable method where after the initial implementation we can exactly estimate further needs for professional services, if needed.  Through a special hands-on training process we pass on the continuation of the setup to the client, giving them control and empowering their in-house capabilities.  In fact, the last project was run by accountants trained on FlexiCapture template creation, not IT.  Please watch out for a press release on this subject in the next few days.

This process has worked well for all participants in the near past, and we plan to continue polishing this process in the future.

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

Admiring the new iPhone 4G with the new and improved camera and a high-definition crystal clear screen, I immediately pop up whit dozens of ideas what I could do with that. As the quality of hardware improves, what used to be negligible becomes more and more pronounced.

Think about this – 20 years ago a photograph was a photograph and no one would question those pesky pixels. With the birth of computers, digital picture viewing, and digital picture taking, picture quality became one of the most important concerns for many. As the technology improves, it only encourages an infinite race towards perfection.

Today, and the screen of the iPhone 4G improves the user perception of the picture, the shadowy gray pictures no longer cut it.

iPhone 3Gs picture of a random text for OCR

Instead, we desire crispiness, high quality contrast, and most importantly appeal to our ultimate judge – the eye. A simple submit to an online OCR system through e-mail or API can return the same image within seconds – but in a different light. The image could be deskewed (lines straightened), despeckled (pixel noise removal), and binarized (remove all colors). Obviously not correct for pictures of people and buildings, but this does wonders on text documents, business cards, signs.

Image after being cleaned up through an Online OCR engine

Now one can fully utilize the new sharp screen they got on their iPhone 4G to view these types of images. Of course, this benefit is useful in those cases where looking at images is desired. Otherwise, I would take it one step further and view the actual OCR result for a true digital sharpest possible text.

Result form OCR conversion in MS Word document

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

This month we are starting a Summer 2010 series of Tips & Tricks related to OCR and form processing industry in general, with a touch of mobile image processing, 3rd party tools and utilities, and best practices.  For the past 10+ years I have been building projects for a wide range of companies and have acquired a unique perspective into what works and what does not, even though it sounds great on paper.  Having used mostly ABBYY OCR for these implementations, I plan to cover ABBYY Recognition Server and ABBYY FlexiCapture product lines, but most of these generic approaches and tricks should work for all other OCR and Data Capture systems out there.  Stay tuned for new information in this series.

Happy OCR-ing,
Ilya Evdokimov, CDIA+

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

When I talk to people about the unique technique of printing text documents to image just for the purpose of running optical character recognition ( OCR ) or data capture on them, they are rightfully confused and think I’m a little nutz.

Why would you ever convert an already digital document back to image? I promise it’s not because I’m so fond of OCR; it actually has its purpose.

Language Detection: By converting a document to image for OCR, I can check the language of each word in the document. While I would much prefer to use a language detection tool on a digital file, there is no robust tool that exists to do this at volume. The unique aspect of OCR engines is that they contain morphology and dictionaries. This is where OCR has improved its accuracy in the past 5 years. OCR engines attempt to identify the language of text in order to better read the document. Because this mechanism is already built into the engine, if I convert a digital file to image and OCR it, I can tell you what languages exist in that document. Additionally, while font is a clear indicator of language, if it is not accompanied by the proper language encoding, it will not tell the digital process what a language is, and in OCR there is no need for such an encoding.

Normalization of digital formats: While a PDF created in Acrobat and a PDF created in a third party tool look identical to the viewer, internally these PDF files are very different. In order to accurately digitally parse a PDF file, you have to have a standard format that is used. If you do not have a standard format, you are dealing with variations in the document visually and its infrastructure. This becomes an overwhelming number of variations. For example, a collection of invoices has as many variations as there are invoices’ times as many PDF generating applications exist. However, if you were to OCR the PDF to parse, versus digital parsing, then you are dealing with only the number of variants that exist in the invoices themselves.

However crazy it sounds like, the above two are real scenarios and there are many more. I doubt that these problems will always exist, but it makes you think twice about crazy statements such as printing a digital document to image just so you can OCR it.

Chris Riley – Industry Expert

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

In some organizations, document preparation prior to scanning is the largest time cost in their document entry process. In all organizations, it’s an important consideration. Document preparation is the processes of sorting, organizing, and preparing documents for the most successful document scan and chance at accuracy in downstream software processes. Sometimes document preparation is as simple as dividing pages into a small enough stack that a document scanner can handle, to as complex as staple removing, envelop opening, and document separation using page separators.

As recognition technology advances, the need for document preparation diminishes. New technologies are allowing for automatic document separation based on templates or keywords, automatic document rotation, annotation, sorting, etc. The challenge for organizations becomes picking what document preparation step to use technology on versus manual labor. This has been a challenging question and as new technologies surface, it becomes even more challenging.

If an organization keeps its focus on return on investment, the path should become clear. Complete evaluation of the technologies will show accuracy and % of automation that can be accomplished with technology, and the amount of time and cost it will save. The tricky part of the evaluation is really in the understanding of the environment. Doing a study of how document preparation is currently done, and all document preparations required for document entry should be fairly straight-forward. Listing the features of document preparation that can be handled by software and those products that have them is a little more complex and requires an organization to spend dedicated time on it. The process of separating documents and barcodeing documents tends to be the biggest cost and the low hanging fruit to seek automation for. Using OCR software can determine document start and end with keywords versus a person manually placing separator pages or barcodes on the document.

For most organizations the result is a combination of manual and automatic. The ultimate goal would be to automate every step in document preparation that can be automated and leave those that have to be manual such as placing documents in a scanner.

Chris Riley – Industry Expert

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

The two most common question when organizations ask when they are seeking document automation technology is “how fast is it?” and “how accurate is it?”. Many don’t realize that the two are at opposition to each other most of the time. The more accurate a system, the slower it is, and the faster it is, the less accurate. But there is one fatal mistake in all these calculations, and that mistake is how efficiency is calculated.

Most companies who trial data capture, calculate performance on the slowest step which is optical character recognition (OCR). Literally, companies will hit the “read” button and immediately start timing until the read is complete. This is what is considered the speed of the document automation system. This is incorrect.

There is no question that OCR can be a tremendous bottleneck in the entire entry process, but poor OCR could create an even greater bottleneck. Imagine an OCR engine that reads a document with 100 characters in 1 second as compared to an engine that reads the same 100 characters in 3 seconds. Your initial thought is that the first engine would be better, but consider that the first engine may be 60% accurate leaving 40 characters to be manually entered, and the other engine 98% accurate leaving 2 characters to be manually entered or correct. If you consider an average entry speed of 1.6 characters per second then it will take the 40 characters an additional 25 seconds to enter for a total entry time of 26 seconds for the faster engine. For the slower engine it will take an additional 1.25 seconds to enter or edit 2 wrong characters thus a total entry time of 4.25 seconds. This means that end-to-end, the slower engine is 6 times faster in the document automation process then the slower engine.

This simple calculation illustrates the folly in assuming that the slower OCR time makes for a slower overall process. Usually focusing on accuracy has the greatest benefit for an organization unless you are improving the speed of a slower engine with hardware, or two engines are too close to see a benefit.

Chris Riley – Industry Expert

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com