I am trying to extract data from a picture with OCR. I use Tesseract API in C++ to achieve this.
The original picture is this:
Now the for me important data is this:
However the marked blue line is never recognized does not matter what I try.
The code to analyze the picture with tesseract looks like this:
std::string readFromFile(const std::string& filename)
{
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
api->SetPageSegMode(tesseract::PSM_AUTO);
if (api->Init("folder_to_tessdata", "deu+eng")) {
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}
// Open input image with leptonica library
Pix *image = pixRead(filename.c_str());
api->SetImage(image);
// Get OCR result
char *outText = api->GetUTF8Text();
std::string result{ outText };
api->End();
delete[] outText;
pixDestroy(&image);
return result;
}
I tryed to improve the accuracy by preprocessing the image like it is suggested in this question: image processing to improve tesseract OCR accuracy
The Code for the preprocessing:
cv::Mat image;
image = cv::imread(filename, cv::IMREAD_COLOR);
cv::resize(image, image, cv::Size{}, 1.2, 1.2, cv::INTER_CUBIC);
cv::cvtColor(image, image, cv::COLOR_BGR2GRAY);
auto kernel = cv::Mat(1, 1, CV_8UC1, cv::Scalar(1));
cv::dilate(image, image, kernel);
cv::erode(image, image, kernel);
cv::Mat filter;
cv::bilateralFilter(image, filter, 5, 75, 75);
cv::threshold(filter, image, 0, 255, cv::THRESH_BINARY + cv::THRESH_OTSU);
Am I missing something? Can I tweak Tesseract itself more or should I change the preprocessing of the image?
Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text. When that happens, you need to create rules and heuristics that can be used to improve the output OCR quality.
The following results are presented for Tesseract: the original set of samples achieves a precision of 0.907 and 0.901 recall rate, while the preprocessed set leads to a precision of 0.929 and a recall of 0.928.
Tesseract does various image processing operations internally (using the Leptonica library) before doing the actual OCR. It generally does a very good job of this, but there will inevitably be cases where it isn't good enough, which can result in a significant reduction in accuracy.
My reference is here.
Note: You don't need to deal with preprocess steps because it seems you already have a pure image. It doesn't have noises much.
My environment information:
Operating system: Ubuntu 16.04
Tesseract version by the command of tesseract --version
:
tesseract 4.1.1-rc2-21-gf4ef
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.2
Found AVX
Found SSE
Found libarchive 3.1.2
OpenCV Version by the command of pkg-config --modversion opencv
:
3.4.3
Difference: When I checked your code, I have only seen the clear difference with this one. You are opening the image with leptonica library one more time instead of opencv.
Here is the code and resulted output:
Input:
Output texts:
Al AQ A3 Ad AS A6 Al A8
| 2 3 4 5 6 7 8
WH GN YE GY PK Bu RD VT
K101 K102 K103 K104 K105 K107 K109 K110
Q30,0 Q30.1 Q30.2 Q30.3 Q30.4 Q30.5 Q30.6 Q30.7
=13/L.2 =13/2.2 =13/4.2 =13/6.2 =13/7.2 =13/10.2 FIBL.2 = 1312.2
AS AlO All Al2 AL3 Al4 ALS AL6
9 10 ll 12 13 14 15 16
GY /PK RD/BU WH/GN BN/GN WH/YE YE/BN WH/GY GY/BN
Kl1l K112 y114 K115 K117 K118 K124
Q31,0 Q31.1 Q31.2 Q31.3 Q31.4 Q31.5 Q31.6 Q31.7
=13/13.2 =13/14.2 =13/15.2 =13/16.2 =1B7.2 PIB. =13/21.2
Beckhoff KL 2809
Code:
#include <string>
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <opencv2/opencv.hpp>
using namespace std;
using namespace cv;
int main(int argc, char* argv[])
{
string outText;
// Create Tesseract object
tesseract::TessBaseAPI *ocr = new tesseract::TessBaseAPI();
ocr->Init(NULL, "eng", tesseract::OEM_LSTM_ONLY);
// Set Page segmentation mode to PSM_AUTO (3)
ocr->SetPageSegMode(tesseract::PSM_AUTO);
// Open input image using OpenCV
Mat im = cv::imread("/ur/image/directory/tessatest.png", IMREAD_COLOR);
// Set image data
ocr->SetImage(im.data, im.cols, im.rows, 3, im.step);
// Run Tesseract OCR on image
outText = string(ocr->GetUTF8Text());
// print recognized text
cout << outText << endl;
// Destroy used object and release memory
ocr->End();
return EXIT_SUCCESS;
}
The compilation of the code:
g++ -O3 -std=c++11 test.cpp -o output `pkg-config --cflags --libs tesseract opencv`
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With