Tips for recognizing multiple languages and processing documents with mixed languages

Tips for recognizing multiple languages and processing documents with mixed languages

OCR recognizing multiple languages

OCR-IT API can recognize text in over 180 languages, more than most other OCR systems in or out of Cloud environments.  This powerful feature makes this API useful in every region of the World without the need to change API structure, develop different code, or sign up to any other services.  We currently have numerous users implementing our API to process text from images generated globally, and we continue to expand to more and more supported languages.  Language setting is one of primary parameters for successful OCR conversion, and it is FREE for you to use, unless you turn on one of specialty languages (see list below, costs extra).  Selecting incorrect language for a document most likely will cause degraded speed and quality of OCR, and frequently all text may become unreadable, so it is an important parameter.

Auto-detect multiple languages?  Sure!

If you do not know in advance what language will be present on the next picture or in the next document, select several language choices at once, and OCR-IT API will select the best language to use.  For example, in Canada, a user may take a picture of something in French, immediately followed by another picture in English.  Or a company in Germany may receive a fax in English, followed by a fax in German, followed by a fax in French languages.  In such situations, selecting multiple languages in OCR-IT API automatically resolves this complex technical challenge.  But there are a few suggestions which will optimize your multi-language environment:

  • Use fewest number of possible languages for highest recognition result.  If you can precisely know which language to use with which document, such as separate folders by language, that is the best option.  This will produce highest quality and faster speed of processing.  If you have to use a combination of several languages, use as few as possible.  OCR-IT API will process your document with each language (time impact), and at the end will select the best result by quality statistics.  OCR-IT developers suggest not to enable more than 2-5 languages, unless absolutely necessary.
  • Some languages can be separated better than others when mixed.  Languages have different character sets and characteristics.  Some languages are similar.  Other languages are very different.  Mixing substantially distinct languages should produce better auto-detection and OCR quality than mixing similar languages.  For example, enabling English+Russian languages support at the same time is safe, because these two languages are very distinct and use very different character sets (Latin and Cyrillic), making auto-detection easier and more accurate.  On the other hand, English and Spanish use same character set (Latin) and many similar words, making automatic separation of these languages more technical complex.

Auto-detect multiple languages inter-mixed in the same document?  Sure!

Even if your document has multiple languages mixed in the SAME document, OCR-IT API will automatically select appropriate language to use for each word.  For example, a legal agreement in Canada may contain English and French languages on the same page.  While most OCR services fail on one or the other language, OCR-IT will process both languages with high quality.  The only thing you need to do is to enable English and French languages through the API.

List of all supported languages is here (scroll to bottom).

Contact OCR-IT Support if you need any additional informational, to request new languages that are not int he current list, and to provide any other feedback or suggestions.