Removing Processing Instructions with Python lxml

Question

I am using the python lxml library to transform XML files to a new schema but I've encountered problems parsing processing instructions from the XML body.

The processing instruction elements are scattered throughout the XML, as in the following example (they all begin with "oasys" and end with a unique code):

string = "<text><?oasys _dc21-?>Text <i>contents</i></text>"

I can't locate them through the lxml.etree.findall() method, although etree.getchildren() returns them:

tree = lxml.etree.fromstring(string)
print tree.findall(".//")
>>>> [<Element i at 0x747c>]
print tree.getchildren()
>>>> [<?oasys _dc21-?>, <Element i at 0x747x>]
print tree.getchildren()[0].tag
>>>> <built-in function ProcessingInstruction>
print tree.getchildren()[0].tail
>>>> Text

Is there an alternative to using getchildren() to parse and remove processing instructions, especially considering that they're nested at various levels throughout the XML?

mzjn · Accepted Answer

You can use the processing-instruction() XPath node test to find the processing instructions and remove them using etree.strip_tags().

Example:

from lxml import etree

string = "<text><?oasys _dc21-?>Text <i>contents</i></text>"
tree = etree.fromstring(string)

pis = tree.xpath("//processing-instruction()")
for pi in pis:
    etree.strip_tags(pi.getparent(), pi.tag)

print etree.tostring(tree)

Output:

<text>Text <i>contents</i></text>

Removing Processing Instructions with Python lxml

Tags:

python

xml

lxml

meng_die

1 Answers

mzjn

Recent Activity

Donate For Us

Removing Processing Instructions with Python lxml

Tags:

python

xml

lxml

meng_die

1 Answers

mzjn

Related questions

Recent Activity

Donate For Us