I am trying to read all PDF files from a folder to look for a number using regular expression. On inspection, the charset for PDFs is 'UTF-8'.
Throws this error:
'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
Tried reading in binary mode, tried Latin-1 encoding, but it shows all special characters so nothing shows up in search.
import os
import re
import pandas as pd
download_file_path = "C:\\Users\\...\\..\\"
for file_name in os.listdir(download_file_path):
try:
with open(download_file_path + file_name, 'r',encoding="UTF-8") as f:
s = f.read()
re_api = re.compile("API No\.\:\n(.*)")
api = re_api.search(s).group(1).split('"')[0].strip()
print(api)
except Exception as e:
print(e)
Expecting to find API number from PDF files
PDF files are stored as bytes.
Therefore to read or write a PDF file you need to use rb
or wb
.
with open(file, 'rb') as fopen:
q = fopen.read()
print(q.decode())
'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
might occur because of your editor
or the PDF is not utf encoded(generally).
Therefore use ,
with open(file, 'rb') as fopen:
q = fopen.read()
print(q.decode('latin-1')) #or any encoding which is suitable here.
If your editor console
is incompatible then also you wont be able to see any output.
A NOTE : you can't use encoding
param while using rb
so you have to decode after reading the file.
When you open a file with open(..., 'r', encoding='utf-8')
you are basically guaranteeing that this is a text file containing no bytes which are not UTF-8. But of course, this guarantee cannot hold for a PDF file - it is a binary format which may or may not contain strings in UTF-8. But that's not how you read it.
If you have access to a library which reads PDF and extracts text strings, you could do
# Dunno if such a library exists, but bear with ...
instance = myFantasyPDFlibrary('file.pdf')
for text_snippet in instance.enumerate_texts_in_PDF():
if 'API No.:\n' in text_snippet:
api = text_snippet.split('API No.:\n')[1].split('\n')[0].split('"')[0].strip()
More realistically, but in a more pedestrian fashion, you could read the PDF file as a binary file, and look for the encoded text.
with open('file.pdf', 'rb') as pdf:
pdfbytes = pdf.read()
if b'API No.:\n' in pdfbytes:
api_text = pdfbytes.split(b'API No.:\n')[1].split(b'\n')[0].decode('utf-8')
api = api_text.split('"')[0].strip()
A crude workaround is to lie to Python about the encoding, and claim that it's actually Latin-1. This particular encoding has the attractive feature that every byte maps exactly to its own Unicode code point, so you can read binary data as text and get away with it. But then, of course, any actual UTF-8 will be converted to mojibake (so "hëlló"
will render as "hëlló"
for example). You can extract actual UTF-8 text by converting the text back to bytes and then decoding it with the correct encoding (latintext.encode('latin-1').decode('utf-8')
).
Just switch to a a different codec packag: encoding = 'unicode_escape'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With