I can read all xmls files that starts with <?xml version="1.0" encoding="utf-8"?>
but I can not read the files starts with <?xml version="1.0" encoding="ISO-8859-1"?>
.
Specifically, I have two files:
xml_iso.xml:
<?xml version="1.0" encoding="ISO-8859-1"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
</note>
xml-utf.xml:
<?xml version="1.0" encoding="utf-8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
</note>
With the following code I can find the note
for the file with utf-8
but I can not find it in the file with the other encoding. How can I solve that?
Example code:
import unittest
from bs4 import BeautifulSoup as Soup
class TestEncoding(unittest.TestCase):
def test_iso(self):
with open('tests/xml-iso.xml', 'r') as f_in:
xml_soup = Soup(f_in.read(), 'xml')
print('xml-iso:\n{}'.format(xml_soup))
note = xml_soup.find('note')
self.assertIsNotNone(note)
def test_utf8(self):
with open('tests/xml-utf.xml', 'r') as f_in:
xml_soup = Soup(f_in.read(), 'xml')
print('xml-utf8:\n{}'.format(xml_soup))
note = xml_soup.find('note')
self.assertIsNotNone(note)
if __name__ == '__main__':
unittest.main()
Versions:
Python 3.5.2
beautifulsoup4==4.6.0
Coincidentally I stumbled upon another workaround. Read the file in binary mode ('rb'
):
with open('tests/xml-iso.xml', 'rb') as f_in:
xml_soup = Soup(f_in.read(), 'xml')
I have the exact same problem. My workaround is to not read the xml declaration:
with open('tests/xml-iso.xml', 'r', encoding='iso-8859-1') as f_in:
f_in.readline() # skipping header and letting soup create its own header
xml_soup = Soup(f_in.read(), 'xml', from_encoding='ISO-8859-1')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With