Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing Processing Instructions with Python lxml

Tags:

python

xml

lxml

I am using the python lxml library to transform XML files to a new schema but I've encountered problems parsing processing instructions from the XML body.

The processing instruction elements are scattered throughout the XML, as in the following example (they all begin with "oasys" and end with a unique code):

string = "<text><?oasys _dc21-?>Text <i>contents</i></text>"

I can't locate them through the lxml.etree.findall() method, although etree.getchildren() returns them:

tree = lxml.etree.fromstring(string)
print tree.findall(".//")
>>>> [<Element i at 0x747c>]
print tree.getchildren()
>>>> [<?oasys _dc21-?>, <Element i at 0x747x>]
print tree.getchildren()[0].tag
>>>> <built-in function ProcessingInstruction>
print tree.getchildren()[0].tail
>>>> Text 

Is there an alternative to using getchildren() to parse and remove processing instructions, especially considering that they're nested at various levels throughout the XML?

like image 201
meng_die Avatar asked Jul 20 '15 16:07

meng_die


1 Answers

You can use the processing-instruction() XPath node test to find the processing instructions and remove them using etree.strip_tags().

Example:

from lxml import etree

string = "<text><?oasys _dc21-?>Text <i>contents</i></text>"
tree = etree.fromstring(string)

pis = tree.xpath("//processing-instruction()")
for pi in pis:
    etree.strip_tags(pi.getparent(), pi.tag)

print etree.tostring(tree)

Output:

<text>Text <i>contents</i></text>
like image 122
mzjn Avatar answered Sep 18 '22 10:09

mzjn