Tesseract OCR: very inaccurate result

Question

Below is my very simple program to test Tesseract performance. The result I got was not as expected though the picture was a high quality and very clear screenshot (not a complex picture with colors). Please take a look at my code and the result below. I'm not sure if I did something wrong or the Tesseract engine can not handle this?

enter image description here

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Drawing.Imaging;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using tessnet2;

namespace ImageProcessTesting
{
public partial class Form1 : Form
{
    public Form1()
    {
        InitializeComponent();
    }

    private void button1_Click(object sender, EventArgs e)
    {
        int up_lef_x = 1075;
        int up_lef_y = 0070;

        int bo_rig_x = 1430;
        int bo_rig_y = 0095;

        int width = bo_rig_x - up_lef_x;
        int height = bo_rig_y - up_lef_y;

        var bmpScreenshot = new Bitmap(width, height, PixelFormat.Format32bppArgb);
        var gfxScreenshot = Graphics.FromImage(bmpScreenshot);

        gfxScreenshot.CopyFromScreen(
                                    1075,
                                    0070,
                                    0,
                                    0,
                                    Screen.PrimaryScreen.Bounds.Size,
                                    CopyPixelOperation.SourceCopy);

        // bmpScreenshot.Save("C:\Users\Exa\Screenshot.png", ImageFormat.Png);


        var image = bmpScreenshot;
        var ocr = new Tesseract();
        ocr.Init(@"C:\Users\Exa\Desktop	essdata", "eng", false);
        var result = ocr.DoOCR(image, Rectangle.Empty);
        string result_str = "";
        foreach (Word word in result)
            result_str += word.Text;
        MessageBox.Show(result_str);

    }
}
}

Jamesits · Accepted Answer

96DPI screen shots are typically not adequate for OCR. As written in Tesseract wiki:

There is a minimum text size for reasonable accuracy. You have to consider resolution as well as point size. Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi. A quick check is to count the pixels of the x-height of your characters. (X-height is the height of the lower case x.) At 10pt x 300dpi x-heights are typically about 20 pixels, although this can vary dramatically from font to font. Below an x-height of 10 pixels, you have very little chance of accurate results, and below about 8 pixels, most of the text will be "noise removed".

However, if you know what exact font it is, you can try re-train tesseract to get better result.

Tesseract OCR: very inaccurate result

Tags:

c#

tesseract

Louis Tran

1 Answers

Jamesits

Recent Activity

Donate For Us

Tesseract OCR: very inaccurate result

Tags:

c#

tesseract

Louis Tran

1 Answers

Jamesits

Related questions

Recent Activity

Donate For Us