Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get font of recognized character with Tesseract-OCR

Tags:

tesseract

Is it possible to get the font of the recognized characters with Tesseract-OCR, i.e. are they Arial or Times New Roman, either from the command-line or using the API.

I'm scanning documents that might have different parts with different fonts, and it would be useful to have this information.

like image 521
sashoalm Avatar asked Mar 28 '13 10:03

sashoalm


People also ask

What is OEM and PSM in Tesseract?

The --oem argument, or OCR Engine Mode, controls the type of algorithm used by Tesseract. The --psm controls the automatic Page Segmentation Mode used by Tesseract.

Can Tesseract recognize handwriting?

Tesseract OCR doesn't work well on handwritten texts. When passing the handwritten segment into Tesseract, we get very poor reading results. See below.

How do I use Tesseract to read text from an image?

Create a Python tesseract script Create a project folder and add a new main.py file inside that folder. Once the application gives access to PDF files, its content will be extracted in the form of images. These images will then be processed to extract the text.

What is Tessdata in Tesseract OCR?

Language data files tessdata: The standard model that only works with Tesseract 4.0. 0. Contains both legacy engine (--oem 0)and LSTM neural net based engine (--oem 1).


1 Answers

Tesseract has an API WordFontAttributes function defined in ResultIterator class that you can use.

like image 192
nguyenq Avatar answered Sep 29 '22 10:09

nguyenq