BeautifulSoup Decode error

Question

I am trying to parse html files generated by Evernote using Beautiful Soup. The code is:

html = open('D:/page.html', 'r')
soup = BeautifulSoup(html)

It gives following error:

File "C:\Python33\lib\site-packages\bs4\__init__.py", line 161, in __init__ markup = markup.read() File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 24274: character maps to <undefined>

How to resolve this issue ?

Martijn Pieters · Accepted Answer

Pass an encoded byte string, or even a file object (opened in binary mode) to BeautifulSoup instead; it'll handle the decoding:

with open('D:/page.html', 'rb') as html:
    soup = BeautifulSoup(html)

BeautifulSoup looks for HTML metadata in the document itself (such as a <meta> tag with charset attribute to decode the document; failing that the chardet library is used to make an (educated) guess about what encoding is used. chardet uses heuristics and statistics about byte sequences used to provide BeautifulSoup with the most likely codec.

If you have more context and already know the correct codec to use, pass that in with the from_encoding argument:

with open('D:/page.html', 'rb') as html:
    soup = BeautifulSoup(html, from_encoding=some_explicit_codec)

See the Encodings section of the documentation.

BeautifulSoup Decode error

Tags:

python

beautifulsoup

bhavesh

1 Answers

Martijn Pieters

Recent Activity

Donate For Us

BeautifulSoup Decode error

Tags:

python

beautifulsoup

bhavesh

1 Answers

Martijn Pieters

Related questions

Recent Activity

Donate For Us