OCR with tesseract, python and pytesseract

2 minutes reading

Python is super versatile, it has a giant community that has libraries that allow to achieve great things with few lines of code, Optical Character Recognition (OCR) is one of them, for that you just need to install tesseract and the python bindings, called pytesseract.

Some OCR applications

OCR is quite useful for social networks, where you can scan the text that appears in the images to read its content and then process it or give it statistical treatment.

Here's another case, imagine a program that scans image boards or social networks, extracts a couple of images from the posted videos and links them to a tik tok account using the watermark that appears on each video.

Or maybe a page that uploads images of your products with your prices written on each of them. With OCR it is possible to get all their prices, and upload them to your database, downloading and processing their images.

Facebook must use some kind of similar technology to censor images that include offensive text, according to its policies, that are uploaded to its social network.

Another of the most common applications is the transformation of a pdf book into images to text, ideal for transforming old book scans to epub or text files.

Installation of tesseract-ocr

To perform OCR with Python we will need tesseract, which is the library that handles all the heavy lifting and image processing.

Make sure you install the newest tesseract-ocr, there is a huge difference between version 3 and versions after 4, as neural networks were implemented to improve character recognition. I am using version 5 alpha.

sudo apt install tesseract-ocr
tesseract -v
tesseract 5.0.0-alpha-20201224-3-ge1a3

Differences in OCR engine efficiency between tesseract 3 and tesseract 5 alpha. Version 5 shows better performance](images/OCRTesseractVersion5vsVersion3-2.png "Comparison between OCR performance of tesseract 3 and tesseract 5")

Installing languages in tesseract

We can see which languages are installed with --list-langs.

tesseract --list-langs

It is obvious, but it is necessary to mention that the extent to which it recognizes the text will depend on whether we use it in the correct language. Let's install the Spanish language.

sudo apt install tesseract-ocr-spa
tesseract --list-langs
List of available languages (3):

You will see that Spanish is now installed and we can use it to detect the text in our images by adding the -l spa option to the end of our command

OCR with tesseract

Now let's put it to the test to recognize text in images, straight from the terminal. I am going to use the following image:

Image with text to be processed](images/imagen_with_text.jpg "File: image_with_text.jpg")

tesseract imagen_con_texto.jpg -
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 139
Do you have the time to listen to me whine

The "-" at the end of the command tells tesseract to send the results of the analysis to the standard output, so that we can view them in the terminal.

It is possible to tell tesseract which OCR engine to use:

  • 0: for the original tesseract
  • 1: for neural networks
  • 2: tesseract and neural networks
  • 3: Default, whichever is available
tesseract imagen_con_texto.jpg - --oem 1

Consider that not all language files work with the original tesseract (0 and 3). Although generally the neural networks one is the one that gives the best result. You can find the models compatible with the original tesseract and neural networks in the tesseract repository.

You can install them manually by downloading them and moving them to the appropriate folder, in my case it is /usr/local/share/tessdata/, but it may be different on your system.

wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
sudo mv eng.traineddata /usr/local/share/tessdata/

Installing pytesseract

After installation we add pytesseract (the python bindings) and pillow (for image management) to our virtual environment.

pipenv install pytesseract pillow

Read text from images with python

First let's check the languages we have installed.

import pytesseract
from PIL import Image
import pytesseract

# ['eng', 'osd', 'spa']

Now that we have the languages, we can read the text of our images.

The code is quite short and self-explanatory. Basically we pass the image as an argument to pytesseract's image_to_string() method.

import pytesseract

from PIL import Image
import pytesseract

img = Image.open("nuestra_imagen.jpg") # Abre la imagen con pillow
text = pytesseract.image_to_string(img, lang='eng') # Extrae el texto de la imagen

# Do you have the time to listen to me whine...

image_to_string() can receive as argument the language in which we want it to detect the text.

Tesseract when with a method with which we can obtain much more information from the image, image_to_data(), available for versions higher than 3.05.

data = pytesseract.image_to_data(img)

If you want to learn more visit the complete tesseract documentation.

Related content