In this specific project asked by one of our users, we would like to provide analysis and suggestions how to process photos of marathon runners and OCR and extract text data from these pictures. This article will describe the fully automated OCR Cloud 2.0 API approach and automated tools for developers to be used without human intervention in processing of these images. If you are interested in semi-automated process including human verification options, please contact us separately.
In this project, there are several parts we will discuss separately, but overall we believe it is possible to achieve good recognition result on most good images.
This project can be considered medium-to-hard complexity project, due to multiple factors, technology limitations, and multiple decision steps in approach.
We will test several images from the same category to illustrate how OCR works internally, what limitations exist in these specific images, and what we can do to optimize output quality.
First, we will test one random image and describe every step happening to that specific image in background processes. These same processes will happen on each image processed.
NOTE : It should be noted that original photographs have high resolution, and are large files around 3 MB in file size. Only for this visual explanation and illustration purposes images (above and below) were decreased in size.
For simplicity of explanation, and to further explain how OCR engines operate internally, let’s review the binarized image next.
Binarization – the process of converting every pixel in the photo to either black or white, which effectively converts the photo into a pure black & white image.
OCR will use this binarized image for further processing internally. OCR will also use some grey and color information to further increase processing quality, but effects from that additional image informaiton adn color depth are minimal, and will be skipped in this discussion.
Before we proceed to text recognition, let’s notice a few binarization effects.
Observation: In the color image, the +3 was clearly visible in orange color. In the binarized image orange color on light blue background was converted to white, which caused its disappearance. There was not enough contrast between graphic and its background. Since it disappeared from binarized image, this text/graphic will not be processed by OCR, since it is ‘invisible’ in the black & white plane used by OCR.
Observation: In the color image, the text “accelmed” and other test around the logo is somewhat readable, but in the binarized image it is unreadable. For this reason, it will be unreadable by OCR. There is not enough contrast (light text on light background) as well as resolution (size is too small, too few pixels). The image is also distorted, further complicating the process.
As a rule of thumb, if the text is straight and clearly visible and readable by human eye in black & white image, we can expect good OCR result.
Observation: Digital images will frequently have shadows, creases on objects, overlapping objects, and many other imperfections that may cause obstructions to clean character recognition. In this example, a crease in paper produced noise visible in the binarized image on top of character 2. Unfortunately, this noise is a) connected to the character, which alters the character’s structure, and b) looks like apostrophe or a stress mark, which is a valid symbol in many languages. Distortions like this sometimes affect OCR quality.
NOTE: Some developers choose to pre-process images before submitting them for OCR processing. This is a good option to consider, especially since developers can fine-tune binarization to their image sources and characteristics. Transmitting black & white images for API submission is also much faster due to decreased file size.
Once the image is going into OCR process, after Binarization, next steps are Analysis and Recognition.
Analysis – the detection and separation of objects present in the picture into discrete components. There are four types of components: text, picture, table, barcode.
Recognition – the process of converting image data into text characters.
Here is the original binarized image after Anlysis:
Observation: After analysis, there were 5 distinct areas located on this binarized photo. Green blocks contain what looks like text. Red blocks contain what looks like pictures. There were no Tables or Barcodes detected on this photo.
As we can see in green text blocks here, other than “ZONC” logo which will produce valid characters, the rest of the recognition will produce error characters, because that is not real text. Also, we see “205” text inside of red picture block, but it will not be converted to text because it is within a picture (like a logo).
OCR-IT OCR Cloud 2.0 API contains several Analysis modes. The above mode ‘MixedDocument’ is useful for standard documents, such as typical office documents, brochures, newspapers, etc. It can handle a combination of text and images on the page. But in some cases it is more desirable to extract ANY text, even if it is within pictures. Then ‘TextAgressive’ analysis mode can be used. Please see OCR-It OCR Cloud 2.0 API for additional information here, and search for ‘AnalysisMode’ keyword: ocr-it.com/ocr-cloud-2-0-api/documentation
In ‘TextAgressive’ mode, we can extract more text:
Observation: With this analysis mode geared towards maximum text extraction, we can extract more text than before. This mode overlooks picture elements in favor of text.
After Analysis, the actual text recognition starts. All blocks that were determined to contain test, will be processed with the specified OCR language.
In this case, if developer wants to extract numbers only, developer may use “DigitsOnly” language instead of generic “English”, which will produce incorrect OCR result for any non-digit blocks, but it will produce high accuracy OCR for the main numeric field, which is the only field we are interested in for this project.
Next, let’s generalize and review the entire project. Here are a few images:
We have the following facts and requirements:
– each image is unique
– position of text changes with every runner AND every photo
– need to capture numbers only
– there are many thousands of such pictures
Most clearly visible numbers can be processed successfully by making only two setting adjustments from default values to OCR-IT OCR Cloud 2.0 API.
1. AnalysisMode should be set to TextAgressive (or Indexing, try which one works better for specific project).
2. Language should be set to NumbersOnly
This will maximize the success rate or recognition of target data. OCR result will contain data produced from other elements, logos, creases that look like characters, road signs, etc., so it should be properly filtered.
Many numbers with excessive skews or obstructions will not be able to produce good OCR quality with any kind of pre-processing, other than human assistance:
This concludes the detailed exploration of this project example. If you have any questions, or would like us to study and explain your specific samples, please contact us.
Ilya Evdokimov is a long-term practitioner and expert in leading Optical Character Recognition (OCR), Data Capture and Document Processing techniques, technologies and solutions. With over 15 years of experience spanning enterprise software implementations, mobile applications development, cloud-based systems integration and desktop-level automation, Ilya Evdokimov uses through industry knowledge and experience to achieve high efficiency and workflow optimization in most challenging paper-dependent and digital image capture environments.