Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unicode error with codecs when reading a pdf file in python

I am trying to read a pdf file with the following contain:

%PDF-1.4\n%âãÏÓ

If I read it with open, it works but if I try with codecs.open(filename, encoding="utf8", mode="rb") to get a unicode string, I got the following exception:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Do you know a way to get a unicode string from the content of this file?

PS: I am using python 2.7

like image 566
trez Avatar asked Apr 09 '26 18:04

trez


1 Answers

PDFs are made of binary data, not text. They cannot be meaningfully represented as Unicode strings.

For what it's worth, you can get a Unicode string containing those particular characters by treating the PDF as ISO8859-1 text:

f = codecs.open(filename, encoding="ISO8859-1", mode="rb")

But at that point, you're better off just using normal open and reading bytes. Unicode is for text, not data.


Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!