Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

chinese character recognition using Tesseract OCR

I have been using Tesseract 3.0.2 OCR SDK for image text extraction. But if I use Chinese text images and pass through OCR then Tesseract doesn't provide me the Chinese characters instead of that I am getting numeric and english characters. But I need Chinese characters as displayed in the image I am using.

How can I achieve this? Is there any way I can obtain Chinese characters rather than any other characters?

like image 431
Nishant Tyagi Avatar asked May 16 '13 07:05

Nishant Tyagi


People also ask

Can Tesseract read other languages?

Figure 6: Tesseract can also OCR right-to-left languages like Arabic. Using the --lang ara flag, we're able to tell Tesseract to OCR Arabic text.

Is there a better OCR than Tesseract?

The 7 best OCR software are Nanonets, ReadIRIS, ABBYY FineReader, Kofax OmniPage, Adobe Acrobat Pro DC, Tesseract, and SimpleOCR. It is critical to examine what features are most essential to you while selecting the best OCR software.

Is Tesseract good for OCR?

Tesseract does various image processing operations internally (using the Leptonica library) before doing the actual OCR. It generally does a very good job of this, but there will inevitably be cases where it isn't good enough, which can result in a significant reduction in accuracy.


1 Answers

You need to download chinese trained data (it will be a file like chi_sim.traineddata) and add it to your tessdata folder.

To download the file https://github.com/tesseract-ocr/tessdata/raw/master/chi_sim.traineddata

and use like this

Tesseract* tesseract= [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"chi_sim"];

if you have any problem you can download my experiment with tessaract (with chinese language support) from https://github.com/aryansbtloe/ExperimentWithTesseract.git

I have tested this one...Hope you will find this useful.

like image 98
Alok Singh Avatar answered Oct 02 '22 22:10

Alok Singh