unicode error with codecs when reading a pdf file in python

Question

I am trying to read a pdf file with the following contain:

%PDF-1.4
%âãÏÓ

If I read it with open, it works but if I try with codecs.open(filename, encoding="utf8", mode="rb") to get a unicode string, I got the following exception:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Do you know a way to get a unicode string from the content of this file?

PS: I am using python 2.7

Admin · Accepted Answer

PDFs are made of binary data, not text. They cannot be meaningfully represented as Unicode strings.

For what it's worth, you can get a Unicode string containing those particular characters by treating the PDF as ISO8859-1 text:

f = codecs.open(filename, encoding="ISO8859-1", mode="rb")

But at that point, you're better off just using normal open and reading bytes. Unicode is for text, not data.

unicode error with codecs when reading a pdf file in python

Tags:

python

pdf

unicode

trez

1 Answers

Recent Activity

Donate For Us

unicode error with codecs when reading a pdf file in python

Tags:

python

pdf

unicode

trez

1 Answers

Related questions

Recent Activity

Donate For Us