UnicodeDecodeError with Tesseract OCR in Python

Tags:

Iam trying to extract text from an image file using Tesseract OCR in Python but I'am facing an Error that i can figure out how to deal with it. all my environment is good as i tested some sample image with the ocr in python!

here is the code

from PIL import Image
import pytesseract
strs = pytesseract.image_to_string(Image.open('binarized_image.png'))

print (strs)

the follow is the error I get from eclipse console

strs = pytesseract.image_to_string(Image.open('binarized_body.png'))
  File "C:\Python35x64\lib\site-packages\pytesseract\pytesseract.py", line 167, in image_to_string
    return f.read().strip()
  File "C:\Python35x64\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 20: character maps to <undefined>

Iam using python 3.5 x64 on Windows10

474

asked Dec 15 '15 15:12

Nwawel A Iroume

1 Answers

The problem is that python is trying to use the console's encoding (CP1252) instead of what it's meant to use (UTF-8). PyTesseract has found a unicode character and is now trying to translate it into CP1252, which it can't do. On another platform you won't encounter this error because it will get to use UTF-8.

You can try using a different function (possibly one that returns bytes instead of str so you won't have to worry about encoding). You could change the default encoding of python as mentioned in one of the comments, although that will cause problems when you go to try and print the string on the windows console. Or, and this is my recommended solution, you could download Cygwin and run python on that to get a clean UTF-8 output.

If you want a quick and dirty solution that won't break anything (yet), here's a way that you might consider:

import builtins

original_open = open
def bin_open(filename, mode='rb'):       # note, the default mode now opens in binary
    return original_open(filename, mode)

from PIL import Image
import pytesseract

img = Image.open('binarized_image.png')

try:
    builtins.open = bin_open
    bts = pytesseract.image_to_string(img)
finally:
    builtins.open = original_open

print(str(bts, 'cp1252', 'ignore'))

139

answered Oct 30 '22 13:10

randomusername

Related questions
                            
                                How to add custom font in python-flask?
                            
                                What does a star * alone mean in a function declaration? [duplicate]
                            
                                Alternative in python to subprocess
                            
                                How to escape single quote in xpath 1.0 in selenium for python
                            
                                How to formally insert URL space (%20) using Python? [duplicate]
                            
                                classification: PCA and logistic regression using sklearn
                            
                                Output SQL as string from pandas.DataFrame.to_sql
                            
                                Grouping and auto increment based on columns in pandas
                            
                                Trouble setting request specific timeout in Elasticsearch DSL
                            
                                Python Statsmodels QuantReg Intercept
                            
                                Choosing order of bars in Bokeh bar chart
                            
                                AttributeError: ‘module’ object has no attribute 'scores'
                            
                                Building a covariance matrix in Python
                            
                                list comprehension with numpy arrays - bad practice?
                            
                                Retrieve indices of NaN values in a pandas dataframe
                            
                                How to change screen in Python if Screen Manager is defined in kivy file?
                            
                                How to display a graph in ipython notebook
                            
                                Using bulk_insert_mappings
                            
                                sorting list of list in python
                            
                                How to tell if a model instance is new or not when using UUIDField as a Primary Key

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

UnicodeDecodeError with Tesseract OCR in Python

Tags:

python

tesseract

python-tesseract

Nwawel A Iroume

People also ask

1 Answers

randomusername

Recent Activity

Donate For Us