OCRing Magazines

by Ilya Evdokimov | Mar 03, 2010 | OCR

Magazines are either very easy to OCR

Often times when I receive printed periodicals, my preference is to OCR them to a digital search-able format and read the articles I’m interested in on my computer, just like my online periodicals. One of these printed documents might be a magazine. Magazines are either very easy to OCR or very difficult, and usually both cases exist in a single magazine. It all has to do with the graphical elements that are often incorporated in magazines.

Text printed on graphics. Very often articles will have text printed over related graphics. If entire paragraphs are printed over a single graphic, it’s less challenging; but when text overlaps graphic and white-space, it’s problematic because a single word will change from color to black normal text in order to contrast the images.

Annotated images. Many magazines including my favorite scientific one, includes text as part of diagrams in the articles. To many this text may be irrelevant, but to me, it has become important search words at the very least. These annotations tend to be small font and often hard for the OCR engine to identify because of close proximity to images.

The good news is that for the most part the purpose of OCRing any magazine is to make its text, searchable. Anything more would probably be illegal. The other good news is that there are tricks to deal with each of these problems. First, a magazine that is being OCRed must be scanned in color. The additional information provided by the color scan will help the OCR engine to distinguish graphics from text on graphics. Second, is to enable full recognition of any engine and any settings geared to small fonts. Third, is to turn off document analysis or enable limited document analysis. This is the less obvious setting. By disabling document analysis, you don’t allow the OCR engine to get confused by strange structure, text printed on graphics, and annotated images. You are forcing it to read all possible text.

Being that text-searchable is the greatest benefit to OCRing my periodicals, I have opted for the OCR settings that produce the most text and the least structure. If you are converting similar documents, I recommend doing the same.

Chris Riley – Sr. Solutions Architect