I am trying to parse html files generated by Evernote using Beautiful Soup. The code is:
html = open('D:/page.html', 'r')
soup = BeautifulSoup(html)
It gives following error:
File "C:\Python33\lib\site-packages\bs4\__init__.py", line 161, in __init__
markup = markup.read()
File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 24274: character maps to <undefined>
How to resolve this issue ?
Pass an encoded byte string, or even a file object (opened in binary mode) to BeautifulSoup instead; it'll handle the decoding:
with open('D:/page.html', 'rb') as html:
soup = BeautifulSoup(html)
BeautifulSoup looks for HTML metadata in the document itself (such as a <meta>
tag with charset
attribute to decode the document; failing that the chardet
library is used to make an (educated) guess about what encoding is used. chardet
uses heuristics and statistics about byte sequences used to provide BeautifulSoup with the most likely codec.
If you have more context and already know the correct codec to use, pass that in with the from_encoding
argument:
with open('D:/page.html', 'rb') as html:
soup = BeautifulSoup(html, from_encoding=some_explicit_codec)
See the Encodings section of the documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With