I have been using Tesseract 3.0.2 OCR SDK for image text extraction. But if I use Chinese text images and pass through OCR then Tesseract doesn't provide me the Chinese characters instead of that I am getting numeric and english characters. But I need Chinese characters as displayed in the image I am using.
How can I achieve this? Is there any way I can obtain Chinese characters rather than any other characters?
Figure 6: Tesseract can also OCR right-to-left languages like Arabic. Using the --lang ara flag, we're able to tell Tesseract to OCR Arabic text.
The 7 best OCR software are Nanonets, ReadIRIS, ABBYY FineReader, Kofax OmniPage, Adobe Acrobat Pro DC, Tesseract, and SimpleOCR. It is critical to examine what features are most essential to you while selecting the best OCR software.
Tesseract does various image processing operations internally (using the Leptonica library) before doing the actual OCR. It generally does a very good job of this, but there will inevitably be cases where it isn't good enough, which can result in a significant reduction in accuracy.
You need to download chinese trained data (it will be a file like chi_sim.traineddata) and add it to your tessdata folder.
To download the file https://github.com/tesseract-ocr/tessdata/raw/master/chi_sim.traineddata
and use like this
Tesseract* tesseract= [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"chi_sim"];
if you have any problem you can download my experiment with tessaract (with chinese language support) from https://github.com/aryansbtloe/ExperimentWithTesseract.git
I have tested this one...Hope you will find this useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With