Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In lxml, how do I remove a tag but retain all contents?

Tags:

The problem is this: I have an XML fragment like so:

<fragment>text1 <a>inner1 </a>text2 <b>inner2</b> <c>t</c>ext3</fragment> 

For the result, I want to remove all <a>- and <c>-Tags, but retain their (text)-contents, and childnodes just as they are. Also, the <b>-Element should be left untouched. The result should then look thus

<fragment>text1 inner<d>1</d> text2 <b>inner2</b> text3</fragment> 

For the time being, I'll revert to a very dirty trick: I'll etree.tostring the fragment, remove the offending tags via regex, and replace the original fragment with the etree.fromstring result of this (not the real code, but should go something like this):

from lxml import etree fragment = etree.fromstring("<fragment>text1 <a>inner1 </a>text2 <b>inner2</b> <c>t</c>ext3</fragment>") fstring = etree.tostring(fragment) fstring = fstring.replace("<a>","") fstring = fstring.replace("</a>","") fstring = fstring.replace("<c>","") fstring = fstring.replace("</c>","") fragment = etree.fromstring(fstring) 

I know that I can probably use xslt to achieve this, and I know that lxml can make use of xslt, but there has to be a more lxml native approach?

For reference: I've tried getting there with lxml's element.replace, but since I want to insert text where there was an element node before, I don't think I can do that.

like image 241
Thor Avatar asked Jan 13 '11 14:01

Thor


People also ask

What is lxml etree?

lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.

What does the lxml parser do?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).


1 Answers

Try this: http://lxml.de/api/lxml.etree-module.html#strip_tags

>>> etree.strip_tags(fragment,'a','c') >>> etree.tostring(fragment) '<fragment>text1 inner1 text2 <b>inner2</b> text3</fragment>' 
like image 135
Kabie Avatar answered Sep 22 '22 15:09

Kabie