Iam trying to extract text from an image file using Tesseract OCR in Python but I'am facing an Error that i can figure out how to deal with it. all my environment is good as i tested some sample image with the ocr in python!
here is the code
from PIL import Image
import pytesseract
strs = pytesseract.image_to_string(Image.open('binarized_image.png'))
print (strs)
the follow is the error I get from eclipse console
strs = pytesseract.image_to_string(Image.open('binarized_body.png'))
File "C:\Python35x64\lib\site-packages\pytesseract\pytesseract.py", line 167, in image_to_string
return f.read().strip()
File "C:\Python35x64\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 20: character maps to <undefined>
Iam using python 3.5 x64 on Windows10
Create a Python tesseract script Create a project folder and add a new main.py file inside that folder. Once the application gives access to PDF files, its content will be extracted in the form of images. These images will then be processed to extract the text.
Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text.
Learn how to import the pytesseract package into your Python scripts. Use OpenCV to load an input image from disk. Pass the image into the Tesseract OCR engine via the pytesseract library. Display the OCR'd text results on our terminal.
The problem is that python is trying to use the console's encoding (CP1252) instead of what it's meant to use (UTF-8). PyTesseract has found a unicode character and is now trying to translate it into CP1252, which it can't do. On another platform you won't encounter this error because it will get to use UTF-8.
You can try using a different function (possibly one that returns bytes
instead of str
so you won't have to worry about encoding). You could change the default encoding of python as mentioned in one of the comments, although that will cause problems when you go to try and print the string on the windows console. Or, and this is my recommended solution, you could download Cygwin and run python on that to get a clean UTF-8 output.
If you want a quick and dirty solution that won't break anything (yet), here's a way that you might consider:
import builtins
original_open = open
def bin_open(filename, mode='rb'): # note, the default mode now opens in binary
return original_open(filename, mode)
from PIL import Image
import pytesseract
img = Image.open('binarized_image.png')
try:
builtins.open = bin_open
bts = pytesseract.image_to_string(img)
finally:
builtins.open = original_open
print(str(bts, 'cp1252', 'ignore'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With