[C++] Extracting Text From Image With OpenCV And Tesseract

Reading Time: 10 minutes

Hi guys, it’s been a while. Lately, I’ve been working on some OCR projects in which I got to write C++ for most of the time. Speaking of C++, a language that can blow your whole leg off, though the high performance is intriguing, the whole process of environment setting – configuring – compiling might be unnecessarily daunting, especially to those who have no experience with compiled languages like C or C++.

That being said, in this post, we will first prepare the development environment. [Spoiler alert, I’m a guy who loves to do it hard so I will install OpenCV and Tesseract from source]. After that, I’m gonna show you how to link libraries and get your code compiled with CMake. For demonstration purposes, we will create a simple C++ project to read some images containing text with OpenCV and use Tesseract to extract the text to the console output. Let’s get started!

Environment Preparation

For the sake of reproduction, I highly recommend you guys to use docker. For those who don’t, you may come across some unexpected errors.

Let’s go ahead and create a file named Dockerfile. The first line will be the base image, which is ubuntu:18.04.

FROM ubuntu:18.04

Next, we need to specify the WORKDIR. Basically, whenever you execute some command through RUN, WORKDIR is your current directory. This is extremely important to remember. You will see why soon.

WORKDIR /home/usr

Now we will install the necessary libraries for OpenCV. A side note though, you have to add -y to every apt-get install. When building a docker image, technically you lose the ability to react to any prompts from the console.

RUN apt-get update && apt-get install -y build-essential && \
    apt-get install -y cmake vim git libgtk2.0-dev pkg-config \
    libavcodec-dev libavformat-dev libswscale-dev && \
    apt-get install -y python-dev python-numpy libtbb2 \
    libtbb-dev libjpeg-dev libpng-dev libtiff-dev libdc1394-22-dev \
    && apt-get clean && rm -rf /var/lib/apt-lists/*

Next, we will clone opencv and opencv_contrib repositories. The opencv repository has grown quite large in size recently, so it’s gonna take a while.

RUN git clone https://github.com/opencv/opencv.git
RUN git clone https://github.com/opencv/opencv_contrib.git

Now, we can build and install OpenCV. As you can see, the below commands are done inside only one RUN command because we are creating and navigate through directories relatively. I do not recommend doing them separately since you will have to be very careful with the paths.

RUN cd opencv && mkdir -p build && cd build && \
    cmake -D CMAKE_BUILD_TYPE=Release -D \
             CMAKE_INSTALL_PREFIX=/usr/local -D \
             OPENCV_GENERATE_PKGCONFIG=ON -D \
             OPENCV_EXTRA_MODULES_PATH=/home/usr/opencv_contrib/modules .. \
    && make -j4 && make install

You can have a cup of coffee now. The building process may take about 30 minutes to 1 hour depending on your computer’s specs.

Okay, after a long while, OpenCV has been successfully installed. Let’s move on to installing Tesseract. The process is pretty much the same, firstly we have to install some dependencies.

RUN apt-get install -y automake ca-certificates g++-8 \
    git libtool libleptonica-dev make pkg-config && \
    apt-get install -y --no-install-recommends asciidoc \
    docbook-xsl xsltproc && apt-get install -y libpango1.0-dev \
    && apt-get install -y libicu-dev libpango1.0-dev libcairo2-dev

Next, let’s clone the tesseract repository.

RUN git clone https://github.com/tesseract-ocr/tesseract.git

You’re now familiar with this, aren’t you? We will go-ahead to build and install Tesseract to our machine.

RUN cd tesseract && ./autogen.sh && ./configure \
    && make && make install && make training && \
    make training-install && ldconfig

Again, the building process may take another while. Go grab yourself another drink. Don’t worry though, it won’t take so long. I think the long compilation time may be one of the reasons that people nowadays don’t fancy C++ that much. Another reason? Codes can’t even get compiled!

One final step we have to do is to download the pre-trained weights for Tesseract’s LSTM model, which we are about to use to detect text inside images.

RUN mkdir /usr/local/share/tessdata

RUN curl -o /usr/local/share/tessdata/eng.traineddata \
    https://raw.githubusercontent.com/tesseract-ocr/tessdata_best/master/eng.traineddata

And that’s it. The preparation is now complete. You can save the Dockerfile and run the following command to build the image.

docker build -t tesseract-opencv .

Text Extraction Project

So we now have everything set up and the next big question is: how are we gonna make use of the installed libraries. Well, it requires more than just a single pip install. But I will try my best to guide you through.

Here’s a brief introduction to the project that we’re gonna create. I have some captures that I took from Faster R-CNN paper and Tesseract’s Wiki page. What we’re gonna do is pretty simple: use OpenCV to read the image and then use Tesseract to extract the contained text. Let’s keep it straightforward so that we can focus on how to link libraries and compile our code.

Figure 1: capture from Faster R-CNN
Figure 2: capture from Tesseract page

Let’s create a new directory for the demonstration project called text-extraction.

mkdir text-extraction && cd text-extraction

Next, let’s run the docker image that we created earlier. We will mount the text-extraction folder to /home/usr/app on the docker’s container.

docker run -it -v $(PWD):/home/usr/app tesseract-opencv bash

Before we write the actual code and get lost in the compilation error matrix, it’s a good idea just to write a simple C++ code just to check if we can include OpenCV and Tesseract in our project. Let’s create a new file called main.cpp.

// main.cpp

#include <fstream>
#include <iostream>
#include <string>
#include <filesystem>
#include <chrono>
#include <leptonica/allheaders.h>
#include <tesseract/baseapi.h>
#include <opencv2/opencv.hpp>

int main() {
  std::cout << "Hello, Wolrd!" << std::endl;
}

As you can see, we need baseapi.h from tesseract, allheaders.h from leptonica and opencv.hpp from opencv.

There are a bunch of ways to compile C++ code. I personally prefer CMake. Let’s create a new file called CMakeLists.txt.

Open up the file. First, we need to specify the minimum version of CMake.

cmake_minimum_required(VERSION 3.0 FATAL_ERROR)

Next, we need to set the compiler for CMake to use. Internally, CMake will choose among the available compiler (CMake itself is not a compiler). Since we’re gonna use some new features like std::filesystem::directory_iterator to loop through a directory, let’s tell CMake to use g++-8.

set(CMAKE_CXX_COMPILER /usr/bin/g++-8)

Then, we will name our project to TEXT_EXTRACTION.

project(TEXT_EXTRACTION)

Coming next, we will tell CMake to find the libraries/packages we need for our project. First, OpenCV is easy, since we used CMake to install it before.

find_package(OpenCV REQUIRED)

Tesseract, on the other hand, is a little bit trickier. We did not install it via CMake, but luckily, we can rely on pkg-config to find the directory to which it was installed.

find_package(PkgConfig REQUIRED)
pkg_search_module(TESSERACT REQUIRED tesseract)
pkg_search_module(LEPTONICA REQUIRED lept)

So all the packages are found, we can then tell CMake to include them in our project.

include_directories(${OpenCV_INCLUDE_DIRS})
include_directories(${TESSERACT_INCLUDE_DIRS})
include_directories(${LEPTONICA_INCLUDE_DIRS})

Next, we need to tell CMake what to compile and name the target executable. In this project, we only have one source file so it’s pretty straightforward.

add_executable(main main.cpp)

Not done yet. We must then link the libraries to the target (i.e. OpenCV, Tesseract, and Leptonica). That’s exhausting, I know. Above we only told CMake to add the include directories, which are only .h files. We can’t live without their implementation, that’s why we have to link the libraries too (by the way, they are not .cpp files but usually .so files, which were compiled when we installed the libraries).

target_link_libraries(main ${OpenCV_LIBS})
target_link_libraries(main ${TESSERACT_LIBRARIES})
target_link_libraries(main ${LEPTONICA_LIBRARIES})

One last step though, we need to tell CMake to compile with C++17 and stdc++fs (standard c++ filesystem) library. Remember I said above that we’ll need it to iterate through the image directory?

target_link_libraries(main stdc++fs) # need this if using g++ 8
set_property(TARGET main PROPERTY CXX_STANDARD 17)

Okay guys, now it should be fine. Let’s go ahead and compile the code.

During the building process, CMake will generate a bunch of files and folders, you won’t want to mess them up with your source code. It almost became an industry standard that we should create a separate folder to store the compilation result.

mkdir build && cd build

Okay, let’s configure the project.

cmake ..

“Hmm. I’m kind of confused. What is configure? Is it different from compile?”

Basically speaking, there are three separate steps involved when building from source:

  • Configure: CMake will gather all the necessary information for the compilation such as where to find the dependencies, what compiler to use, where to install the target etc
  • Compile: CMake will invoke the compiler to compile source files into individual object files (.o files), then link them together to generate the final executable.
  • Install (optional): the final executable will then be copied to the system’s bin path so that it can be invoked from everywhere. This is usually the case when you install libraries like OpenCV or Tesseract.

So, I hope you can distinguish three steps now. By the way, if your console output after cmake .. looks something like below, then you’re good to go ahead.

-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 8.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/g++-8
-- Check for working CXX compiler: /usr/bin/g++-8 -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenCV: /usr/local (found version "4.3.0")
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.29.1")
-- Checking for one of the modules 'tesseract'
-- Checking for one of the modules 'lept'
-- Configuring done
-- Generating done
-- Build files have been written to: /home/usr/app/build

Next, it’s time to compile our code. OMG, finally!

make

The main.cpp will then be compiled. If nothing goes wrong, you will see something like below:

Scanning dependencies of target main
[ 50%] Building CXX object CMakeFiles/main.dir/main.cpp.o
[100%] Linking CXX executable main
[100%] Built target main

Now, if you take a look at the content of the build directory, among some files and folders generated by CMake, there is a file called main. That is the final executable that we’re long for!

Let’s execute it right away.

./main

Hello, World

Ladies and gentlemen, I present to you: Hello, World in C++ made by CMake! That might sound unfair since we had to include other libraries 😛.

Good job, guys. At this point, I promise you, the hardest part is already over. Alright, let’s get to our main task.

First, we need to get rid of the std::cout line in the main function.

Next, let’s define a string to store the path to the image directory and a cv::Mat to hold the value of the input image.

// main.cpp
// inside main function

  std::string imagePath = "../images";
  cv::Mat image;

Then, we need to initiate and configure TesseractAPI.

  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
  api->Init(NULL, "eng", tesseract::OEM_LSTM_ONLY);
  api->SetPageSegMode(tesseract::PSM_AUTO);
  api->SetVariable("debug_file", "tesseract.log");

Now, we need to go through each image inside the directory. To do that, we will use a new C++ feature: directory_iterator.

  for (const auto &fn : std::filesystem::directory_iterator(path)) {

Inside the for loop, we will first read the image from file:

    image = cv::imread(filepath, 1);

Then we will set the image to the TesseractAPI object.

    api->SetImage(image.data, image.cols, image.rows, 3, image.step);

Next, we will call the GetUTF8Text() on the TesseractAPI to extract the text from the image. We then print the result to the console.

    std::string outText = api->GetUTF8Text();
    std::cout << outText << std::endl;

Remember to release the pointer to the object after the for loop ends.

  }
  api->End();

Below is the complete main.cpp. I also added some lines to measure the inference time.

#include <fstream>
#include <iostream>
#include <string>
#include <filesystem>
#include <chrono>
#include <leptonica/allheaders.h>
#include <tesseract/baseapi.h>
#include <opencv2/opencv.hpp>

int main() {
  std::string imagePath = "../images";
  cv::Mat image;

  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
  api->Init(NULL, "eng", tesseract::OEM_LSTM_ONLY);
  api->SetPageSegMode(tesseract::PSM_AUTO);
  api->SetVariable("debug_file", "tesseract.log");

  for (const auto &fn : std::filesystem::directory_iterator(path)) {
    auto start = std::chrono::steady_clock::now();
    auto filepath = fn.path();
    std::cout << "Detecting text in " << filepath << std::endl;

    image = cv::imread(filepath, 1);

    api->SetImage(image.data, image.cols, image.rows, 3, image.step);
    std::string outText = api->GetUTF8Text();
    std::cout << outText << std::endl;

    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double, std::milli> diff = end - start;
    std::cout << "Computation time: " << diff.count() << "ms" << std::endl;
  }
  api->End();
}

Let’s compile the code again. Whenever you update your code, you need to re-compile it. The good thing is, CMake will only re-compile the updated file, which reduces the compilation time significantly.

make

Okay, in the project root path, let’s create a folder named images and put in there some images that you want to extract text from. Note that the logic inside main.cpp is very basic, therefore you should only use images with black-on-white-crystal-clear text or you can just use mine.

Now, let’s see the result, shall we?

cd build
./main
Figure 3: inference image 1
 anchors. An anchor is centered at the sliding window
 in question, and is associated with a scale and aspect
 ratio (Figure 3, left). By default we use 3 scales and
 3 aspect ratios, yielding k = 9 anchors at each sliding
 position. For a convolutional feature map of a size
 W x H (typically ~2,400), there are W Hk anchors in
 total.

 Translation-Invariant Anchors

 An important property of our approach is that it
 is translation invariant, both in terms of the anchors
 and the functions that compute proposals relative to
 the anchors. If one translates an object in an image,
 the proposal should translate and the same function
 should be able to predict the proposal in either lo-
 cation. This translation-invariant property is guaran-
 teed by our method’. As a comparison, the MultiBox
 method [27] uses k-means to generate 800 anchors,
 which are not translation invariant. So MultiBox does
 not guarantee that the same proposal is generated if
 an object is translated.

 The translation-invariant property also reduces the
 model size. MultiBox has a (4 + 1) x 800-dimensional
 fully-connected output layer, whereas our method has
 a (4 + 2) x 9-dimensional convolutional output layer
 in the case of k = 9 anchors. As a result, our output
 layer has 2.8 x 10? parameters (512 x (4 + 2) x 9

 Multi-Scale Anchors as Regression References

 Our design of anchors presents a novel scheme
 for addressing multiple scales (and aspect ratios). As
 shown in Figure 1, there have been two popular ways
 for multi-scale predictions. The first way is based on
 image/feature pyramids, e.g., in DPM [8] and CNN-
 based methods [9], [1], [2]. The images are resized at
 multiple scales, and feature maps (HOG [8] or deep
 convolutional features [9], [1], [2]) are computed for
 each scale (Figure 1(a)). This way is often useful but
 is time-consuming. The second way is to use sliding
 windows of multiple scales (and/or aspect ratios) on
 the feature maps. For example, in DPM [8], models
 of different aspect ratios are trained separately using
 different filter sizes (such as 5x7 and 7x5). If this way
 is used to address multiple scales, it can be thought
 of as a “pyramid of filters” (Figure 1(b)). The second
 way is usually adopted jointly with the first way [8].

 As a comparison, our anchor-based method is built
 on a pyramid of anchors, which is more cost-efficient.
 Our method classifies and regresses bounding boxes
 with reference to anchor boxes of multiple scales and
 aspect ratios. It only relies on images and feature
 maps of a single scale, and uses filters (sliding win-
 dows on the feature map) of a single size. We show by
 experiments the effects of this scheme for addressing
 multiple scales and sizes (Table 8).
Figure 4: inference image 2
 API examples

 This documentation provides simple examples on how to use the tesseract-ocr API (v3.02.02-4.0.0) in
 C++. It is expected that tesseract-ocr is correctly installed including all dependencies. It is expected
 the user is familiar with C++, compiling and linking program on their platform, though basic
 compilation examples are included for beginners with Linux.

 More details about tesseract-ocr API can be found at baseapi.h.

Oh my god. That was crazy, isn’t it? I don’t want to dig too deep into how Tesseract recognizes text in an image, but clearly it’s done a good job, hasn’t it? The fact that it can even extract text in a two-column format material without any preprocessing is tremendously stunning!

Conclusion

So, in today’s post, we have gone through a process of installing OpenCV and Tesseract from source and using them to extract text from images. Furthermore, we did them all using the old-school C++! Let’s forget the myth that C++ is impossible to learn. C++ is hard, I admit, but you don’t have to be a C++ guru to use it in your daily work. With a bit of practice, keep compiling and compiling and you will get the hang of it.

Thank you so much for reading. I will see you in the next post.

Trung Tran is a Deep Learning Engineer working in the car industry. His main daily job is to build deep learning models for autonomous driving projects, which varies from 2D/3D object detection to road scene segmentation. After office hours, he works on his personal projects which focus on Natural Language Processing and Reinforcement Learning. He loves to write technical blog posts, which helps spread his knowledge/experience to those who are struggling. Less pain, more gain.

Leave a reply:

Your email address will not be published.