My xml file is encoding thus:
<?xml version="1.0" encoding="utf-8"?>
I am trying to parse this file using beautiful soup.
from bs4 import BeautifulSoup
fd = open("xmlsample.xml")
soup = BeautifulSoup(fd,'lxml-xml',from_encoding='utf-8')
But this results in
Traceback (most recent call last):
File "C:\Users\gregg_000\Desktop\Python
Experiments\NRE_XMLtoCSV\NRE_XMLtoCSV\bs1.py", line 4, in <module>
soup = BeautifulSoup(fd,'lxml-xml', from_encoding='utf-8')
File
"C:\Users\gregg_000\AppData\Local\Programs\Python\Python36\lib\site-
packages\bs4__init__.py", line 245, in init markup = markup.read() File
"C:\Users\gregg_000\AppData\Local\Programs\Python\Python36\lib\encodings\cp125 2.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5343910: character maps to undefined
My sense is that Python is wanting to use the default cp1252 character set. How can I force utf-8 without having to resort to the command line? (I'm in a set-up where I can't easily force global changes to the python set up).
bs4: Beautiful Soup is a Python library for pulling data out of HTML and XML files.
In Python, we can read and parse XML by leveraging two libraries: BeautifulSoup and LXML. In this guide, we'll take a look at extracting and parsing data from XML files with BeautifulSoup and LXML, and store the results using Pandas.
Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
You should also add the encoding to your open()
call (it's an acceptable argument as the docs indicate). By default in Windows (at least in my install), the default is, as you guessed, cp1252.
from bs4 import BeautifulSoup
fd = open("xmlsample.xml", encoding='utf-8')
soup = BeautifulSoup(fd,'lxml-xml',from_encoding='utf-8')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With