For years we have been polishing one of the most demanded and demanding areas of data capture and OCR – AP department automation.  Processing and automation of numerous variations of different Invoices, Purchase Orders and Agreements is still one of the larger data capture industry’s challenges, but we are proud to offer our proven solution for this task.

About our Invoice and Purchase Order Data Capture and Processing Approach: Invoices are considered some of the more complex documents.  Luckily the technology is capable enough today, no more tedious text parsing necessary, and there is a set of proven methods.  Over the years we have gone through numerous projects and method revisions of setting up those projects, and today I believe we have most balanced method of needed efforts and achieved capabilities through utilization of latest software features.  We bypass the single template approach, which in the past proved to be an unpredictable trap of professional services.  Today we have a repeatable and easily quantifiable method where after the initial implementation we can exactly estimate further needs for professional services, if needed.  Through a special hands-on training process we pass on the continuation of the setup to the client, giving them control and empowering their in-house capabilities.  In fact, the last project was run by accountants trained on FlexiCapture template creation, not IT.  Please watch out for a press release on this subject in the next few days.

This process has worked well for all participants in the near past, and we plan to continue polishing this process in the future.

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

Admiring the new iPhone 4G with the new and improved camera and a high-definition crystal clear screen, I immediately pop up whit dozens of ideas what I could do with that. As the quality of hardware improves, what used to be negligible becomes more and more pronounced.

Think about this – 20 years ago a photograph was a photograph and no one would question those pesky pixels. With the birth of computers, digital picture viewing, and digital picture taking, picture quality became one of the most important concerns for many. As the technology improves, it only encourages an infinite race towards perfection.

Today, and the screen of the iPhone 4G improves the user perception of the picture, the shadowy gray pictures no longer cut it.

iPhone 3Gs picture of a random text for OCR

Instead, we desire crispiness, high quality contrast, and most importantly appeal to our ultimate judge – the eye. A simple submit to an online OCR system through e-mail or API can return the same image within seconds – but in a different light. The image could be deskewed (lines straightened), despeckled (pixel noise removal), and binarized (remove all colors). Obviously not correct for pictures of people and buildings, but this does wonders on text documents, business cards, signs.

Image after being cleaned up through an Online OCR engine

Now one can fully utilize the new sharp screen they got on their iPhone 4G to view these types of images. Of course, this benefit is useful in those cases where looking at images is desired. Otherwise, I would take it one step further and view the actual OCR result for a true digital sharpest possible text.

Result form OCR conversion in MS Word document

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

This month we are starting a Summer 2010 series of Tips & Tricks related to OCR and form processing industry in general, with a touch of mobile image processing, 3rd party tools and utilities, and best practices.  For the past 10+ years I have been building projects for a wide range of companies and have acquired a unique perspective into what works and what does not, even though it sounds great on paper.  Having used mostly ABBYY OCR for these implementations, I plan to cover ABBYY Recognition Server and ABBYY FlexiCapture product lines, but most of these generic approaches and tricks should work for all other OCR and Data Capture systems out there.  Stay tuned for new information in this series.

Happy OCR-ing,
Ilya Evdokimov, CDIA+

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

When I talk to people about the unique technique of printing text documents to image just for the purpose of running optical character recognition ( OCR ) or data capture on them, they are rightfully confused and think I’m a little nutz.

Why would you ever convert an already digital document back to image? I promise it’s not because I’m so fond of OCR; it actually has its purpose.

Language Detection: By converting a document to image for OCR, I can check the language of each word in the document. While I would much prefer to use a language detection tool on a digital file, there is no robust tool that exists to do this at volume. The unique aspect of OCR engines is that they contain morphology and dictionaries. This is where OCR has improved its accuracy in the past 5 years. OCR engines attempt to identify the language of text in order to better read the document. Because this mechanism is already built into the engine, if I convert a digital file to image and OCR it, I can tell you what languages exist in that document. Additionally, while font is a clear indicator of language, if it is not accompanied by the proper language encoding, it will not tell the digital process what a language is, and in OCR there is no need for such an encoding.

Normalization of digital formats: While a PDF created in Acrobat and a PDF created in a third party tool look identical to the viewer, internally these PDF files are very different. In order to accurately digitally parse a PDF file, you have to have a standard format that is used. If you do not have a standard format, you are dealing with variations in the document visually and its infrastructure. This becomes an overwhelming number of variations. For example, a collection of invoices has as many variations as there are invoices’ times as many PDF generating applications exist. However, if you were to OCR the PDF to parse, versus digital parsing, then you are dealing with only the number of variants that exist in the invoices themselves.

However crazy it sounds like, the above two are real scenarios and there are many more. I doubt that these problems will always exist, but it makes you think twice about crazy statements such as printing a digital document to image just so you can OCR it.

Chris Riley – Industry Expert

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

In some organizations, document preparation prior to scanning is the largest time cost in their document entry process. In all organizations, it’s an important consideration. Document preparation is the processes of sorting, organizing, and preparing documents for the most successful document scan and chance at accuracy in downstream software processes. Sometimes document preparation is as simple as dividing pages into a small enough stack that a document scanner can handle, to as complex as staple removing, envelop opening, and document separation using page separators.

As recognition technology advances, the need for document preparation diminishes. New technologies are allowing for automatic document separation based on templates or keywords, automatic document rotation, annotation, sorting, etc. The challenge for organizations becomes picking what document preparation step to use technology on versus manual labor. This has been a challenging question and as new technologies surface, it becomes even more challenging.

If an organization keeps its focus on return on investment, the path should become clear. Complete evaluation of the technologies will show accuracy and % of automation that can be accomplished with technology, and the amount of time and cost it will save. The tricky part of the evaluation is really in the understanding of the environment. Doing a study of how document preparation is currently done, and all document preparations required for document entry should be fairly straight-forward. Listing the features of document preparation that can be handled by software and those products that have them is a little more complex and requires an organization to spend dedicated time on it. The process of separating documents and barcodeing documents tends to be the biggest cost and the low hanging fruit to seek automation for. Using OCR software can determine document start and end with keywords versus a person manually placing separator pages or barcodes on the document.

For most organizations the result is a combination of manual and automatic. The ultimate goal would be to automate every step in document preparation that can be automated and leave those that have to be manual such as placing documents in a scanner.

Chris Riley – Industry Expert

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

The two most common question when organizations ask when they are seeking document automation technology is “how fast is it?” and “how accurate is it?”. Many don’t realize that the two are at opposition to each other most of the time. The more accurate a system, the slower it is, and the faster it is, the less accurate. But there is one fatal mistake in all these calculations, and that mistake is how efficiency is calculated.

Most companies who trial data capture, calculate performance on the slowest step which is optical character recognition (OCR). Literally, companies will hit the “read” button and immediately start timing until the read is complete. This is what is considered the speed of the document automation system. This is incorrect.

There is no question that OCR can be a tremendous bottleneck in the entire entry process, but poor OCR could create an even greater bottleneck. Imagine an OCR engine that reads a document with 100 characters in 1 second as compared to an engine that reads the same 100 characters in 3 seconds. Your initial thought is that the first engine would be better, but consider that the first engine may be 60% accurate leaving 40 characters to be manually entered, and the other engine 98% accurate leaving 2 characters to be manually entered or correct. If you consider an average entry speed of 1.6 characters per second then it will take the 40 characters an additional 25 seconds to enter for a total entry time of 26 seconds for the faster engine. For the slower engine it will take an additional 1.25 seconds to enter or edit 2 wrong characters thus a total entry time of 4.25 seconds. This means that end-to-end, the slower engine is 6 times faster in the document automation process then the slower engine.

This simple calculation illustrates the folly in assuming that the slower OCR time makes for a slower overall process. Usually focusing on accuracy has the greatest benefit for an organization unless you are improving the speed of a slower engine with hardware, or two engines are too close to see a benefit.

Chris Riley – Industry Expert

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

Users of OCR might be surprised to learn that one of the initial and biggest drivers for the technology has yet to be fully actualized. It was believed soon after the invention of optical character recognition by Ray Kurzweil, that the greatest use of the technology would be in assisting language translation. Even Kurzweil himself very quickly used OCR technology to simply convert scanned image to text so that it could be read digitally for the blind. Some of the developers of OCR technology did not even start with any specialty in imaging but actually specialized in language and dictionary software.

The relationship of OCR technology to language is very interesting and several levels deep. For example, the modern engines show greatest improvements in accuracy by deploying more statistical language models and dictionaries vs. core recognition algorithms. In this method, language is improving the accuracy of OCR technology. For example the letter “e” in English is more frequent than the letter “c”, so in the case where there is a question between an “e” and “c”, this information is useful.

But the most sought after initial use of OCR was simply to get digital text in order to convert it to another language. The dream was to enable travelers to take pictures of foreign signs or documents and have them converted on the fly to their native language. While this was one of the biggest drivers for the further development of OCR, the roadblocks of photography, accurate language translation, and poor processing power of mobile devices was overlooked. Because of this, the use of OCR primarily became document automation and a means to reduce the cost of data entry. This focus changed the way the engines were developed with the new focus being document OCR and not photographic.

I’m confident that the dream will eventually be actualized but I also suspect that many changes to the way OCR engines operate, and the appearance of new specialized engines will happen first.

Chris Riley – Industry Expert

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

The search for greater accuracy when it comes to document automation, never stops. It’s true that with every new release, OCR technology has become so advanced that the jumps in accuracy are not what they were 10 years ago. Now, new versions of OCR engines contain enhancements for low quality documents and vertical document types but general OCR can’t get much better. Because of this, modern integrations need to find new tricks. This blog is full of them, but I’m about to explain just one more. OCRing inverted text.

OCRing inverted text is nothing new. Many document types have regions where white text is printed on a black background. The modern engines have an ability to read this text. Typically it’s not as accurate as black text on white background OCR, but it has its unique benefits. Especially with complex document types such as EOBs and drivers licenses.

There is a trick in using inverted text OCR to increase overall OCR accuracy. The method is to first OCR a document normally, then using imaging technology to invert the image. When you invert the image, the black text on white background switches to white text on a black background. Once the inversion is done, run OCR again. By comparing the two OCR results, you have essentially voted the same engine with little effort.

Large volume processing environments can deploy this trick without re-loading a new OCR engine, and applying different settings. It’s important to note that when using this technique, how you compare the two results is as important as the process itself. Typically you will assign more weight to the original version of the document then the inverted one. There you have it, one more tool in increasing the OCR accuracy of the engine you already use.

Chris Riley – Industry Expert

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

I often speak of unique uses of OCR, and here is yet another. OCRing video files! But why? Part of the management of rich media assets is indexing these files. Technologies such as speech recognition and optical character recognition give a greater index and search value to rich media.

By using OCR technology to find and extract text from video frames, the data can be stored as meta-data. In the simplest scenario, this is a text file that accompanies the video file. More complex environments will even tell you the minuet and second the text occurs. Because this is not a traditional use of the technology, some special consideration must take place.

First is converting and separating frames to individual images files. For the OCR to be effective it needs to work on a series of images. Although a video is only a sequence of images that repeat at a high rate of speed, it’s still somewhat of a challenge to convert video files such as MPEG to a series of images. Not only that, dealing with motion blurs that might occur in some frames will also be a problem.

The second challenge is dealing with frames that are repeats. Essentially, because there are so many similar images that are only slightly different from each other, the text on a series of frames might not change. Better OCR results will account for this and not repeat text as the frames would.

And finally dealing with the variations of fonts, and often small sizes. This requires an OCR engine with specific settings for specialized OCR, and one that is very accurate on complex low quality documents.

I expect that in the future, this technique in conjunction with speech recognition will be used in eDiscovery, content management, and robust search of rich media files.

Chris Riley – Industry Expert

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

All technology markets are guilty of coming up with at least one or two confusing terms. In the document imaging world, it’s terms with very similar sounding names. They are technically similar, but strictly different.

One of the most confusing things in the imaging world is the difference between Image Capture software often just called Capture, and Data Capture software. Not only are the names confusing, but technically there is a lot of overlap. All data capture products have imaging capabilities, all capture products have basic data capture. The risk of the confusion is replacing one product for the other. For example, organizations that attempt to take the data capture functionality built into a capture application for a full blown project, end with little success and a lot of frustration. Let me explain where they fit.

Capture products have the primary function of delivering quality images in a proper document structure. They often feature image clean-up, review, and page splitting tools that are more advanced then the scanning found in data capture applications. Most demonstrate what is called rubber-band OCR, the reading of a specific coordinate on a page. Some go as far as creating templates where coordinates zones are saved. This is where the solutions get confused with data capture. Until there is a registration of documents and proper forms processing approaches, it is not data capture. The risk of such basic templates is low accuracy and zones that do not always collect data.

Data capture products need images to function, so it was an obvious choice to add scanning to the solutions. These solutions however are better fed by a full capture application that has the performance and additional features such as batch naming, annotations, page splitting, etc. that the organization may require in the resulting image files. For data capture, the purpose of image capture is for getting data only and sometimes neglect the features that are important for image storage and archival.

In the end, both solutions are improving in the other’s territory. Eventually the lines will blur to the point where feature-wise they will be identical, and the benefit of one over the other will be rooted in the vendors expertise, either capture or data capture. If your primary requirement is quality images, the capture vendors solution is best chosen, but if it’s data extraction, then data capture rooted solutions are better.

Chris Riley – Industry Expert

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com