Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preserving Spaces in Tesseract

I had an image file, which contain some text separated by tabs (2 spaces). But when I extract text out of this image file, I always get a single space between two columns. A sample example:

IMAGE:

col-a    col-b    col-c

Desired output:

col-a    col-b    col-c

But I am getting the following:

col-a col-b col-c

I am using pytesseract.image_to_string (Python module) convert image to text

like image 804
raghu Avatar asked Aug 03 '18 08:08

raghu


People also ask

Is Pytesseract same as tesseract?

Pytesseract or Python-tesseract is an OCR tool for python that also serves as a wrapper for the Tesseract-OCR Engine. It can read and recognize text in images and is commonly used in python ocr image to text use cases.

How does tesseract recognize text from images?

Optical Character Recognition (OCR) is a technology that is used to recognize text from images. It can be used to convert tight handwritten or printed texts into machine-readable texts. To use OCR, you need to install and configure tesseract on your computer. First, download the Tesseract OCR executables here.

Is tesseract a Python library?

Project descriptionPython-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images. Python-tesseract is a wrapper for Google's Tesseract-OCR Engine.

What is the use of tesseract?

Tesseract — is an optical character recognition engine with open-source code, this is the most popular and qualitative OCR-library. OCR uses artificial intelligence for text search and its recognition on images. Tesseract is finding templates in pixels, letters, words and sentences.


1 Answers

Use it like this:

pytesseract.image_to_string(img, config='-c preserve_interword_spaces=1')
like image 141
Rajesh Subbiah Avatar answered Nov 03 '22 00:11

Rajesh Subbiah