Escaping bad XML while parsing

Question

I'm attempting to parse URLs from an XML sitemap that isn't mine. Unfortunately, some of the XML is poorly written and contains unescaped/invalid characters, such as ampersands.

This is the code block I'm using to parse through my XML file currently:

from xml.etree import ElementTree as ET

tree = ET.parse('test.xml')
root = tree.getroot()

name_space = '{http://www.sitemaps.org/schemas/sitemap/0.9}'

urls = []
for child in root.iter():
    for block in child.findall('{}url'.format(name_space)):
        for url in block.findall('{}loc'.format(name_space)):
            urls.append('{}
'.format(url.text))

with open('sample_urls.txt', 'w+') as f:
    f.writelines(urls)

I'm running into this error when it encounters an unescaped URL: ParseError: not well-formed (invalid token).

How can I escape these issues and still continue parsing the file? I've come across the escape() function of the xml.sax.saxutils module, but not sure the best way to apply it based on what I currently have.

Daniel Haley · Accepted Answer

If you can, try using lxml.html. You should be careful though; it ignores namespaces so you need to be sure you're selecting what you intend to select.

Example...

sitemap_products_1.xml (Shortened version of the one you linked to. Notice the second url has a bad loc value.)

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
 xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
 <url>
  <loc>https://www.samsclub.com/sams/mirror-convex/prod13760282.ip</loc>
  <image:image>
   <image:title>See All 160 Degree Convex Security Mirror - 24&quot; w x 15&quot; h</image:title>
   <image:loc>https://scene7.samsclub.com/is/image/samsclub/0003308171524_A</image:loc>
  </image:image>
 </url>
 <url>
  <loc>https://www.samsclub.com/sams/at&t-3-handset-cordless-phone/prod21064454.ip</loc>
  <image:image>
   <image:title>AT&amp;T 3 Handset Cordless Phone</image:title>
   <image:loc>https://scene7.samsclub.com/is/image/samsclub/0065053003067_A</image:loc>
  </image:image>
 </url>
 <url>
  <loc>https://www.samsclub.com/sams/premium-free-flow-waterbed-mattress-kit-queen/104864.ip</loc>
  <image:image>
   <image:title>Premium Free Flow Waterbed Mattress Kit- Queen</image:title>
   <image:loc>https://scene7.samsclub.com/is/image/samsclub/0040649555859_A</image:loc>
  </image:image>
 </url>
</urlset>

Python 3.x

from lxml import html

tree = html.parse("sitemap_products_1.xml")

for elem in tree.findall(".//url/loc"):
    print(elem.text)

Output (Notice the second url is printed in its entirety.)

https://www.samsclub.com/sams/mirror-convex/prod13760282.ip
https://www.samsclub.com/sams/at&t-3-handset-cordless-phone/prod21064454.ip
https://www.samsclub.com/sams/premium-free-flow-waterbed-mattress-kit-queen/104864.ip

Escaping bad XML while parsing

Tags:

python

python-3.x

xml

elementtree

tsb8m

1 Answers

Daniel Haley

Recent Activity

Donate For Us

Escaping bad XML while parsing

Tags:

python

python-3.x

xml

elementtree

tsb8m

1 Answers

Daniel Haley

Related questions

Recent Activity

Donate For Us