Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Opening Image file from url with PIL for text recognition with pytesseract

I am facing a confusing problem trying to download image and open it with BytesIO in order to extract text from it using PIL & pytesseract.

>>> response = requests.get('http://abc/images/im.jpg')
>>> img = Image.open(BytesIO(response.content))
>>> img
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=217x16 at 0x7FDAD185CB38>
>>> text = pytesseract.image_to_string(img)
>>> text
''

Here it gives an empty string.

However if i save the image and then open it again with pytesseract, it gives the right result.

>>> img.save('im1.jpg')
>>> im = Image.open('im1.jpg')
>>> pytesseract.image_to_string(im)
'The right text'

And just to confirm, both give same size.

>>> im.size
(217, 16)
>>> img.size
(217, 16)

What can be the problem? Is it necessary to save the image or am I doing something wrong?

like image 857
sprksh Avatar asked Apr 13 '17 23:04

sprksh


People also ask

How do I use Tesseract to read text from an image?

Create a Python tesseract script Create a project folder and add a new main.py file inside that folder. Once the application gives access to PDF files, its content will be extracted in the form of images. These images will then be processed to extract the text.

Does Pytesseract need Tesseract?

You can use pytesseract to convert images into text. Pytesseract is a Python package that works with tesseract, which is a command-line optical character recognition (OCR) program. It's a super cool package that can read the text contained in pictures.

What is Pytesseract used for?

Pytesseract or Python-tesseract is an OCR tool for python that also serves as a wrapper for the Tesseract-OCR Engine. It can read and recognize text in images and is commonly used in python ocr image to text use cases.


1 Answers

You seem to have a problem which I can't reproduce. So to diagnose your problem, if there is any, were much more details necessary, BUT instead of asking for details I just assume (so my overall experience) that in the process of giving the details your problem will vanish and can't be reproduced. This way is this answer a solution to your problem.

In case it is not, let know if you need further assistance. At least you can be sure, that you are generally right because of what you have experienced and did nothing apparently wrong.

Here the FULL code (your question is missing hints which modules are necessary) AND the image is actually ONLINE so anyone else could also test if the code works or not (you didn't provide an online existing image in your question):

import io
import requests
import pytesseract
from PIL import Image
response = requests.get("http://www.teamjimmyjoe.com/wp-content/uploads/2014/09/Classic-Best-Funny-Text-Messages-earthquake-titties.jpg")
# print( type(response) ) # <class 'requests.models.Response'>
img = Image.open(io.BytesIO(response.content))
# print( type(img) ) # <class 'PIL.JpegImagePlugin.JpegImageFile'>
text = pytesseract.image_to_string(img)
print( text )

Here the pytesseract output:

Hey! I just saw on CNN
there was an earthquake
near you. Are you ok?






‘ Yes! We‘re all line!

What did it rate on the titty
scale?
‘ Well they only jiggled a

little bit, so probably not

that high.
HAHAHAHAHAHA I LOVE
YOU
Richter scale. My phone is l
a 12 yr old boy.

My system: Linux Mint 18.1 with Python 3.6

like image 124
Claudio Avatar answered Nov 03 '22 01:11

Claudio