I need to escape special characters in an invalid XML file which is about 5000 lines long. Here's an example of the XML that I have to deal with:
<root>
<element>
<name>name & surname</name>
<mail>[email protected]</mail>
</element>
</root>
Here the problem is the character "&" in the name. How would you escape special characters like this with a Python library? I didn't find a way to do it with BeautifulSoup.
If you don't care about invalid characters in the xml you could use XML parser's recover
option (see Parsing broken XML with lxml.etree.iterparse):
from lxml import etree
parser = etree.XMLParser(recover=True) # recover from bad characters.
root = etree.fromstring(broken_xml, parser=parser)
print etree.tostring(root)
<root>
<element>
<name>name surname</name>
<mail>[email protected]</mail>
</element>
</root>
You're probably just wanting to do some simple regexp-ery on the HTML before throwing it into BeautifulSoup.
Even simpler, if there aren't any SGML entities (&...;
) in the code, html=html.replace('&','&')
will do the trick.
Otherwise, try this:
x ="<html><h1>Fish & Chips & Gravy</h1><p>Fish & Chips & Gravy</p>"
import re
q=re.sub(r'&([^a-zA-Z#])',r'&\1',x)
print q
Essentially the regex looks for &
not followed by alpha-numeric or # characters. It won't deal with ampersands at the end of lines, but that's probably fixable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With