Why output PDF may decrease in quality and increase in size after OCR

by Ilya Evdokimov | Apr 26, 2012 | Accuracy

Why output PDF may decrease in quality and increase in size after OCR

Decrease in quality and increase in size

Overall, let’s say your original PDF was a multi-page digital document with text and graphics and had a very small size of a few KB.  The text looked very sharp no matter how much you zoomed in.  When processing through OCR, and comparing to the original PDF, you find that a) the overall file size increased substantially and b) the quality of digital pages has decreased when viewed on the screen.

The result is expected if you are processing and saving to PDF Text Under Image.  In this mode, by your specification, the software rasterizes (creates image of) every page in order to produce the output where page image is visible and text is stored under it.  This is the reason for decrease in quality, because the image visible in the document will be more pixelated than digital text, which is hidden under the image.  Also, because in the result PDF now there is a newly create picture of every entire page + the OCR text result, whereas before it was only digital text, the file size can increase substantially.  It takes more storage to store newly created images per page in the result PDF.

I tested a digitally-generated PDF file containing 10 pages and some color graphics.  I think this testing scenario will reproduce this common situation well.  No compression or down-sampling has been specified in export settings for this test, which if used in PDF export settings can help decrease file size further.

  • Original digital 10-page PDF
    Contains digital text and some digital color graphics.
    30.5 KB
  • Processed PDF, Text Only
    Digital test is visible, along with some OCR mistakes.  Formatting around graphics has been altered slightly.
    15.4 KB
  • Processed PDF, Text Under Image
    Rasterized pixelated picture of each page is visible.  Perfect preservation of original look and formatting.  Text is stored under page pictures for selection and searching.
    763 KB
  • Processed PDF, Text Over Image
    Rasterized pixelated picture of each page is visible in some graphics, with good preservation of original formatting, but the text is sharp due to being placed on top of page picture.  OCR inaccuracies are also visible.
    67.8 KB

Notice that this test applies only to ‘digitally-created’ PDFs where text already exists in vector form.  As a result, when saving to Text Under Picture, a whole new picture layer is created, which increases the storage size.  If you were to process ‘image-based’ PDF such as a scan, it would have contained the image of the page before processing.  OCR would add a text layer only, which is small in size, and the size difference would not be noticeable.

Selecting one of three available PDF export types will affect the look and size of the output PDF.  PDF Text Under Image is the most commonly used format for archiving, indexing, and preservation, but in some cases it comes at a cost of size increase if original PDF did not contain an image layer. Available quality and compression algorithms included in the software can help decrease the output size.

Why output PDF may decrease in quality and increase in size after OCR

Why output PDF may decrease in quality and increase in size after OCR

Decrease in quality and increase in size

Overall, let’s say your original PDF was a multi-page digital document with text and graphics and had a very small size of a few KB.  The text looked very sharp no matter how much you zoomed in.  When processing through OCR, and comparing to the original PDF, you find that a) the overall file size increased substantially and b) the quality of digital pages has decreased when viewed on the screen.

The result is expected if you are processing and saving to PDF Text Under Image.  In this mode, by your specification, the software rasterizes (creates image of) every page in order to produce the output where page image is visible and text is stored under it.  This is the reason for decrease in quality, because the image visible in the document will be more pixelated than digital text, which is hidden under the image.  Also, because in the result PDF now there is a newly create picture of every entire page + the OCR text result, whereas before it was only digital text, the file size can increase substantially.  It takes more storage to store newly created images per page in the result PDF.

I tested a digitally-generated PDF file containing 10 pages and some color graphics.  I think this testing scenario will reproduce this common situation well.  No compression or down-sampling has been specified in export settings for this test, which if used in PDF export settings can help decrease file size further.

  • Original digital 10-page PDF
    Contains digital text and some digital color graphics.
    30.5 KB
  • Processed PDF, Text Only
    Digital test is visible, along with some OCR mistakes.  Formatting around graphics has been altered slightly.
    15.4 KB
  • Processed PDF, Text Under Image
    Rasterized pixelated picture of each page is visible.  Perfect preservation of original look and formatting.  Text is stored under page pictures for selection and searching.
    763 KB
  • Processed PDF, Text Over Image
    Rasterized pixelated picture of each page is visible in some graphics, with good preservation of original formatting, but the text is sharp due to being placed on top of page picture.  OCR inaccuracies are also visible.
    67.8 KB

Notice that this test applies only to ‘digitally-created’ PDFs where text already exists in vector form.  As a result, when saving to Text Under Picture, a whole new picture layer is created, which increases the storage size.  If you were to process ‘image-based’ PDF such as a scan, it would have contained the image of the page before processing.  OCR would add a text layer only, which is small in size, and the size difference would not be noticeable.

Selecting one of three available PDF export types will affect the look and size of the output PDF.  PDF Text Under Image is the most commonly used format for archiving, indexing, and preservation, but in some cases it comes at a cost of size increase if original PDF did not contain an image layer. Available quality and compression algorithms included in the software can help decrease the output size.