'utf-8' codec can't decode byte 0xe2 : invalid continuation byte error

Question

I am trying to read all PDF files from a folder to look for a number using regular expression. On inspection, the charset for PDFs is 'UTF-8'.

Throws this error:

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Tried reading in binary mode, tried Latin-1 encoding, but it shows all special characters so nothing shows up in search.

import os
import re
import pandas as pd
download_file_path = "C:\Users\...\..\"
for file_name in os.listdir(download_file_path):
    try:
        with open(download_file_path + file_name, 'r',encoding="UTF-8") as f:
          s = f.read()
          re_api = re.compile("API No\.\:
(.*)")
          api = re_api.search(s).group(1).split('"')[0].strip()
          print(api)
    except Exception as e:
        print(e)

Expecting to find API number from PDF files

2 revs · Accepted Answer

PDF files are stored as bytes. Therefore to read or write a PDF file you need to use rb or wb.

with open(file, 'rb') as fopen:
    q = fopen.read()
    print(q.decode())

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte might occur because of your editor or the PDF is not utf encoded(generally).

Therefore use ,

with open(file, 'rb') as fopen:
        q = fopen.read()
        print(q.decode('latin-1')) #or any encoding which is suitable here.

If your editor console is incompatible then also you wont be able to see any output.

A NOTE : you can't use encoding param while using rb so you have to decode after reading the file.

tripleee · Answer

When you open a file with open(..., 'r', encoding='utf-8') you are basically guaranteeing that this is a text file containing no bytes which are not UTF-8. But of course, this guarantee cannot hold for a PDF file - it is a binary format which may or may not contain strings in UTF-8. But that's not how you read it.

If you have access to a library which reads PDF and extracts text strings, you could do

# Dunno if such a library exists, but bear with ...
instance = myFantasyPDFlibrary('file.pdf')
for text_snippet in instance.enumerate_texts_in_PDF():
    if 'API No.:
' in text_snippet:
        api = text_snippet.split('API No.:
')[1].split('
')[0].split('"')[0].strip()

More realistically, but in a more pedestrian fashion, you could read the PDF file as a binary file, and look for the encoded text.

with open('file.pdf', 'rb') as pdf:
    pdfbytes = pdf.read()
if b'API No.:
' in pdfbytes:
    api_text = pdfbytes.split(b'API No.:
')[1].split(b'
')[0].decode('utf-8')
    api = api_text.split('"')[0].strip()

A crude workaround is to lie to Python about the encoding, and claim that it's actually Latin-1. This particular encoding has the attractive feature that every byte maps exactly to its own Unicode code point, so you can read binary data as text and get away with it. But then, of course, any actual UTF-8 will be converted to mojibake (so "hëlló" will render as "hÃ«llÃ³" for example). You can extract actual UTF-8 text by converting the text back to bytes and then decoding it with the correct encoding (latintext.encode('latin-1').decode('utf-8')).

team meryb · Answer

Just switch to a a different codec packag: encoding = 'unicode_escape'

'utf-8' codec can't decode byte 0xe2 : invalid continuation byte error

Tags:

python

pdf

utf-8

decode

Prat

3 Answers

2 revs

tripleee

team meryb

Recent Activity

Donate For Us

'utf-8' codec can't decode byte 0xe2 : invalid continuation byte error

Tags:

python

pdf

utf-8

decode

Prat

3 Answers

2 revs

tripleee

team meryb

Related questions

Recent Activity

Donate For Us