I am using the python lxml library to transform XML files to a new schema but I've encountered problems parsing processing instructions from the XML body.
The processing instruction elements are scattered throughout the XML, as in the following example (they all begin with "oasys" and end with a unique code):
string = "<text><?oasys _dc21-?>Text <i>contents</i></text>"
I can't locate them through the lxml.etree.findall()
method, although etree.getchildren()
returns them:
tree = lxml.etree.fromstring(string)
print tree.findall(".//")
>>>> [<Element i at 0x747c>]
print tree.getchildren()
>>>> [<?oasys _dc21-?>, <Element i at 0x747x>]
print tree.getchildren()[0].tag
>>>> <built-in function ProcessingInstruction>
print tree.getchildren()[0].tail
>>>> Text
Is there an alternative to using getchildren()
to parse and remove processing instructions, especially considering that they're nested at various levels throughout the XML?
You can use the processing-instruction()
XPath node test to find the processing instructions and remove them using etree.strip_tags()
.
Example:
from lxml import etree
string = "<text><?oasys _dc21-?>Text <i>contents</i></text>"
tree = etree.fromstring(string)
pis = tree.xpath("//processing-instruction()")
for pi in pis:
etree.strip_tags(pi.getparent(), pi.tag)
print etree.tostring(tree)
Output:
<text>Text <i>contents</i></text>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With