Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

handling encoding error with xml with beautiful soup

My xml file is encoding thus:

<?xml version="1.0" encoding="utf-8"?>

I am trying to parse this file using beautiful soup.

from bs4 import BeautifulSoup

fd = open("xmlsample.xml")  
soup = BeautifulSoup(fd,'lxml-xml',from_encoding='utf-8')

But this results in

Traceback (most recent call last):
  File "C:\Users\gregg_000\Desktop\Python 
Experiments\NRE_XMLtoCSV\NRE_XMLtoCSV\bs1.py", line 4, in <module>
    soup = BeautifulSoup(fd,'lxml-xml', from_encoding='utf-8')
  File 
"C:\Users\gregg_000\AppData\Local\Programs\Python\Python36\lib\site- 

packages\bs4__init__.py", line 245, in init markup = markup.read() File

"C:\Users\gregg_000\AppData\Local\Programs\Python\Python36\lib\encodings\cp125 2.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5343910: character maps to undefined

My sense is that Python is wanting to use the default cp1252 character set. How can I force utf-8 without having to resort to the command line? (I'm in a set-up where I can't easily force global changes to the python set up).

like image 759
Greg Williams Avatar asked Feb 22 '19 15:02

Greg Williams


People also ask

Does BeautifulSoup work with XML?

bs4: Beautiful Soup is a Python library for pulling data out of HTML and XML files.

Is the only XML parser available in BeautifulSoup?

In Python, we can read and parse XML by leveraging two libraries: BeautifulSoup and LXML. In this guide, we'll take a look at extracting and parsing data from XML files with BeautifulSoup and LXML, and store the results using Pandas.

What is beautifulsoup4 used for?

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.


1 Answers

You should also add the encoding to your open() call (it's an acceptable argument as the docs indicate). By default in Windows (at least in my install), the default is, as you guessed, cp1252.

from bs4 import BeautifulSoup

fd = open("xmlsample.xml", encoding='utf-8')
soup = BeautifulSoup(fd,'lxml-xml',from_encoding='utf-8')
like image 91
Jonah Bishop Avatar answered Oct 06 '22 01:10

Jonah Bishop