Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup Decode error

I am trying to parse html files generated by Evernote using Beautiful Soup. The code is:

html = open('D:/page.html', 'r')
soup = BeautifulSoup(html)

It gives following error:

File "C:\Python33\lib\site-packages\bs4\__init__.py", line 161, in __init__ markup = markup.read() File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 24274: character maps to <undefined>

How to resolve this issue ?

like image 888
bhavesh Avatar asked Jun 23 '14 17:06

bhavesh


1 Answers

Pass an encoded byte string, or even a file object (opened in binary mode) to BeautifulSoup instead; it'll handle the decoding:

with open('D:/page.html', 'rb') as html:
    soup = BeautifulSoup(html)

BeautifulSoup looks for HTML metadata in the document itself (such as a <meta> tag with charset attribute to decode the document; failing that the chardet library is used to make an (educated) guess about what encoding is used. chardet uses heuristics and statistics about byte sequences used to provide BeautifulSoup with the most likely codec.

If you have more context and already know the correct codec to use, pass that in with the from_encoding argument:

with open('D:/page.html', 'rb') as html:
    soup = BeautifulSoup(html, from_encoding=some_explicit_codec)

See the Encodings section of the documentation.

like image 70
Martijn Pieters Avatar answered Oct 14 '22 05:10

Martijn Pieters