Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tesseract OCR simple example

Tags:

c#

ocr

tesseract

Hi Can you anyone give me a simple example of testing Tesseract OCR preferably in C#.
I tried the demo found here. I download the English dataset and unzipped in C drive. and modified the code as followings:

string path = @"C:\pic\mytext.jpg"; Bitmap image = new Bitmap(path); Tesseract ocr = new Tesseract(); ocr.SetVariable("tessedit_char_whitelist", "0123456789"); // If digit only ocr.Init(@"C:\tessdata\", "eng", false); // To use correct tessdata List<tessnet2.Word> result = ocr.DoOCR(image, Rectangle.Empty); foreach (tessnet2.Word word in result)     Console.WriteLine("{0} : {1}", word.Confidence, word.Text); 

Unfortunately the code doesn't work. the program dies at "ocr.Init(..." line. I couldn't even get an exception even using try-catch.

I was able to run the vietocr! but that is a very large project for me to follow. i need a simple example like above.

like image 322
Will Robinson Avatar asked May 16 '13 22:05

Will Robinson


People also ask

Is Tesseract good for OCR?

While Tesseract is known as one of the most accurate free OCR engines available today, it has numerous limitations that dramatically affect its performance; its ability to correctly recognize characters in a scan or image.

How do I use Tesseract to read text from an image?

Create a Python tesseract script Create a project folder and add a new main.py file inside that folder. Once the application gives access to PDF files, its content will be extracted in the form of images. These images will then be processed to extract the text.

Is Easy OCR better than Tesseract?

Tesseract is performing well for high-resolution images. Certain morphological operations such as dilation, erosion, OTSU binarization can help increase pytesseract performance. EasyOCR is lightweight model which is giving a good performance for receipt or PDF conversion.


2 Answers

Ok. I found the solution here tessnet2 fails to load the Ans given by Adam

Apparently i was using wrong version of tessdata. I was following the the source page instruction intuitively and that caused the problem.

it says

Quick Tessnet2 usage

  1. Download binary here, add a reference of the assembly Tessnet2.dll to your .NET project.

  2. Download language data definition file here and put it in tessdata directory. Tessdata directory and your exe must be in the same directory.

After you download the binary, when you follow the link to download the language file, there are many language files. but none of them are right version. you need to select all version and go to next page for correct version (tesseract-2.00.eng)! They should either update download binary link to version 3 or put the the version 2 language file on the first page. Or at least bold mention the fact that this version issue is a big deal!

Anyway I found it. Thanks everyone.

like image 186
Will Robinson Avatar answered Sep 20 '22 07:09

Will Robinson


A simple example of testing Tesseract OCR in C#:

    public static string GetText(Bitmap imgsource)     {         var ocrtext = string.Empty;         using (var engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default))         {             using (var img = PixConverter.ToPix(imgsource))             {                 using (var page = engine.Process(img))                 {                     ocrtext = page.GetText();                 }             }         }          return ocrtext;     } 

Info: The tessdata folder must exist in the repository: bin\Debug\

like image 39
Adolfo Alejandro Araya Avatar answered Sep 23 '22 07:09

Adolfo Alejandro Araya