Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tesseract OCR Text Position

I am working on OCR using tesseract. I am able to make the application working and get the output. Here i'm trying to extract data from an invoice bill and getting the extracted data. But the spacing between words in input has to be similar in output file.I am now getting each words and coordinates.I need to export to text file according to coordinates

Code Sample :

            using (var engine = new TesseractEngine(Server.MapPath(@"~/tessdata"), "eng", EngineMode.Default))
            {
                engine.DefaultPageSegMode = PageSegMode.AutoOsd;
                // have to load Pix via a bitmap since Pix doesn't support loading a stream.

                using (var image = new System.Drawing.Bitmap(imageFile.PostedFile.InputStream))
                {

                    Bitmap bmp = Resize(image, 1920, 1080);

                    using (var pix = PixConverter.ToPix(image))
                    {
                        using (var page = engine.Process(pix))
                        {
                            using (var iter = page.GetIterator())
                            {
                                iter.Begin();
                                do
                                {
                                    Rect symbolBounds;
                                    string path = Server.MapPath("~/Output/data.txt");
                                    if (iter.TryGetBoundingBox(PageIteratorLevel.Word, out symbolBounds))
                                    {
                                        // do whatever you want with bounding box for the symbol

                                    var curText = iter.GetText(PageIteratorLevel.Word);

                                        //WriteToTextFile(curText, symbolBounds, path);
                                        resultText.InnerText += curText;
                                        // Your code here, 'rect' should containt the location of the text, 'curText' contains the actual text itself
                                    }
                                } while (iter.Next(PageIteratorLevel.Word));
                            }


                            meanConfidenceLabel.InnerText = String.Format("{0:P}", page.GetMeanConfidence());

                        }
                    }
                }
            }

Here is an example of input and output showing the wrong spacing.

InputOutput

like image 481
ab2015 Avatar asked Jul 11 '18 09:07

ab2015


People also ask

Why is the Tesseract OCR not accurate?

Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text.

How does Tesseract recognize text?

Language Data. The tesseract OCR engine uses language-specific training data in the recognize words. The OCR algorithms bias towards words and sentences that frequently appear together in a given language, just like the human brain does.

What is text localization?

The text localization task is to find texts in the image. The text recognition task is to identify the class of each character in the text. We can find the same three tasks in ordinary OCR; we scan the document and remove its skew, find text lines, and recognize characters and words on the text lines.


1 Answers

You can loop through found items in the page using page.GetIterator(). For the individual items you can get a 'bounding box', this is a Tesseract.Rect (rectangle struct) which contains: X1, Y1, X2, Y2 coordinates.

Tesseract.PageIteratorLevel myLevel = /*TODO*/;
using (var page = Engine.Process(img))
using (var iter = page.GetIterator())
{
    iter.Begin();
    do
    {
        if (iter.TryGetBoundingBox(myLevel, out var rect))
        {
            var curText = iter.GetText(myLevel);
            // Your code here, 'rect' should containt the location of the text, 'curText' contains the actual text itself
        }
    } while (iter.Next(myLevel));
}

There is no clear-cut way to use the positions in the input to space the text in the output. You're going to have to write some custom logic for that.

You might be able to estimate the number of spaces you need to the left of your text with something like this:

var padLeftSpaces = (int)Math.Round((rect.X1 / inputWidth) * outputWidthSpaces);
like image 142
GWigWam Avatar answered Sep 22 '22 11:09

GWigWam