As the popularity of reading eBook’s increases so does the demand and need to convert books to an eBook . Legality aside the promise of using OCR technology to create eBook’s is very high, and not too difficult. There are few things to remember when wanting to use OCR to create an eBook. Getting a digital file in the the eBook format is relatively easy, but creating the content for that format is the challenge. Enter Optical Character Recognition OCR. There are several steps to successfully creating an eBook with OCR.
1.)How you scan
2.)How you optimize the image
3.)How you OCR the image
There are two common ways to scan a book. If you are lucky enough to have a book scanner this is the desired approach as this does not require the destruction of the book. These scanners are very pricey, but do a great job. The resulting image with a book scanner is one image for every two pages. We will get to this in a moment. The other way to scan is with a typical document scanner where you remove the binding of the book and use a document scanner to produce image files for each page. In this approach the quality is high, sometimes higher even than a book scanner, but less convenient. It’s important in this approach to keep the book page order correct as often times you have to scan in batches and it’s easy to get pages mixed up. Scanning should be done at 300 DPI Tiff Group 4 Grey-scale. This will produce the ideal image. Unless the book has significantly small fonts these settings will do the trick. Scanning in color would only be required if your book has color photographs.
Once scanning is done and you have image files it’s time to apply imaging. For the most part any scanning done with the binding removed imaging will not be required, perhaps only line straightening, and deskew in case of crooked scans. For the books scanned with a book scanner there are two critical imaging tools that will always be applied, first is page separation. This is the imaging that separates the left side and the right side of the image as two separate pages in the book. The result is two separate image files. Next on each of these image files line straightening is required. Because the binding of a book causes pages to curve inward this curve appears as curved lines in a book scan. Line straightening finds the base-line for each line in the page and makes every portion of every line follow it.
Now the magic of OCR can take place. Following these steps for 90% of the books out there will create an accurate eBook. There are many utilities that will then take Text, Doc, XML, etc. and convert it into the desired eBook format. Some tagging may be required for chapters etc. to gain all of the functionality in eBook readers.
Chris Riley – Sr. Solutions Architect
Ilya Evdokimov is a long-term practitioner and expert in leading Optical Character Recognition (OCR), Data Capture and Document Processing techniques, technologies and solutions. With over 15 years of experience spanning enterprise software implementations, mobile applications development, cloud-based systems integration and desktop-level automation, Ilya Evdokimov uses through industry knowledge and experience to achieve high efficiency and workflow optimization in most challenging paper-dependent and digital image capture environments.