Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unicode error with codecs when reading a pdf file in python

I am trying to read a pdf file with the following contain:

%PDF-1.4\n%âãÏÓ

If I read it with open, it works but if I try with codecs.open(filename, encoding="utf8", mode="rb") to get a unicode string, I got the following exception:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Do you know a way to get a unicode string from the content of this file?

PS: I am using python 2.7

like image 566
trez Avatar asked Apr 09 '26 18:04

trez


1 Answers

PDFs are made of binary data, not text. They cannot be meaningfully represented as Unicode strings.

For what it's worth, you can get a Unicode string containing those particular characters by treating the PDF as ISO8859-1 text:

f = codecs.open(filename, encoding="ISO8859-1", mode="rb")

But at that point, you're better off just using normal open and reading bytes. Unicode is for text, not data.