I try to get the whole content between an opening xml tag and it's closing counterpart. Getting the content in straight cases like <code>title</code> below is easy, but how can I get the whole content between the tags if mixed-content is used and I want to preserve the inner tags? <pre class="prettyprint"><code><?xml version="1.0" encoding="UTF-8"?> <review> <title>Some testing stuff</title> <text sometimes="attribute">Some text with <extradata>data</extradata> in it. It spans <sometag>multiple lines: <tag>one</tag>, <tag>two</tag> or more</sometag>.</text> </review> </code></pre> What I want is the content between the two <code>text</code> tags, including any tags: <code>Some text with <extradata>data</extradata> in it. It spans <sometag>multiple lines: <tag>one</tag>, <tag>two</tag> or more</sometag>.</code> For now I use regular expressions but it get's kinda messy and I don't like this approach. I lean towards a XML parser based solution. I looked over <code>minidom</code>, <code>etree</code>, <code>lxml</code> and <code>BeautifulSoup</code> but couldn't find a solution for this case (whole content, including inner tags).

Here's something that works for me and your sample: <pre class="prettyprint"><code>from lxml import etree doc = etree.XML( """<?xml version="1.0" encoding="UTF-8"?> <review> <title>Some testing stuff</title> <text>Some text with <extradata>data</extradata> in it.</text> </review>""" ) def flatten(seq): r = [] for item in seq: if isinstance(item,(str,unicode)): r.append(unicode(item)) elif isinstance(item,(etree._Element,)): r.append(etree.tostring(item,with_tail=False)) return u"".join(r) print flatten(doc.xpath('/review/text/node()')) </code></pre> Yields: <pre class="prettyprint"><code>Some text with <extradata>data</extradata> in it. </code></pre> The xpath selects all child nodes of the <code><text></code> element and either renders them to unicode directly if they are a string/unicode subclass (<code><class 'lxml.etree._ElementStringResult'></code>) or calls <code>etree.tostring</code> on it if it's an <code>Element</code>, <code>with_tail=False</code> avoids duplication of the tail. You may need to handle other node types if they are present.

That is considerably easy with lxml*, using the <code>parse()</code> and <code>tostring()</code> functions: <pre class="prettyprint"><code>from lxml.etree import parse, tostring </code></pre> First you parse the doc and get your element (I am using XPath, but you can use whatever you want): <pre class="prettyprint"><code>doc = parse('test.xml') element = doc.xpath('//text')[0] </code></pre> The <code>tostring()</code> function returns a text representation of your element: <pre class="prettyprint"><code>>>> tostring(element) '<text>Some <text>text</text> with <extradata>data</extradata> in it.</text>\n' </code></pre> However, you do not want the external elements, so we can remove them with a simple <code>str.replace()</code> call: <pre class="prettyprint"><code>>>> tostring(element).replace('<%s>'%element.tag, '', 1) 'Some <text>text</text> with <extradata>data</extradata> in it.</text>\n' </code></pre> Note that <code>str.replace()</code> received 1 as the third parameter, so it will remove only the first occurrence of the opening tag. One can do it with the closing tag, too. Now, instead of 1, we pass -1 to replace: <pre class="prettyprint"><code>>>> tostring(element).replace('</%s>'%element.tag, '', -1) '<text>Some <text>text with <extradata>data</extradata> in it.\n' </code></pre> The solution, of course, is to do everything at once: <pre class="prettyprint"><code>>>> tostring(element).replace('<%s>'%element.tag, '', 1).replace('</%s>'%element.tag, '', -1) 'Some <text>text with <extradata>data</extradata> in it.\n' </code></pre> EDIT: @Charles made a good point: this code is fragile since the tag can have attributes. A possible yet still limited solution is to split the string at the first <code>></code>: <pre class="prettyprint"><code>>>> tostring(element).split('>', 1) ['<text', 'Some <text>text</text> with <extradata>data</extradata> in it.</text>\n'] </code></pre> get the second resulting string: <pre class="prettyprint"><code>>>> tostring(element).split('>', 1)[1] 'Some <text>text</text> with <extradata>data</extradata> in it.</text>\n' </code></pre> then rsplitting it: <pre class="prettyprint"><code>>>> tostring(element).split('>', 1)[1].rsplit('</', 1) ['Some <text>text</text> with <extradata>data</extradata> in it.', 'text>\n'] </code></pre> and finally getting the first result: <pre class="prettyprint"><code>>>> tostring(element).split('>', 1)[1].rsplit('</', 1)[0] 'Some <text>text</text> with <extradata>data</extradata> in it.' </code></pre> Nonetheless, this code is still fragile, since <code>></code> is a perfectly valid char in XML, even inside attributes. Anyway, I have to acknowledge that MattH solution is the real, general solution. * Actually this solution works with ElementTree, too, which is great if you do not want to depend upon lxml. The only difference is that you will have no way of using XPath.

I like @Marcin's solution above, however I found that when using his 2nd option (converting a sub-node, not the root of the tree) it does not handle entities. His code from above (modified to add an entity): <pre class="prettyprint"><code>from lxml import etree t = etree.XML("""<?xml version="1.0" encoding="UTF-8"?> <review> <title>Some testing stuff</title> <text>this &amp; that.</text> </review>""") e = t.xpath('//text')[0] print (e.text + ''.join(map(etree.tostring, e))).strip() </code></pre> returns: <pre class="prettyprint"><code>this & that. </code></pre> with a bare/unescaped '&' character instead of a proper entity ('&amp;'). My solution was to use to call etree.tostring at the node level (instead of on all children), then strip off the starting and ending tag using a regular expression: <pre class="prettyprint"><code>import re from lxml import etree t = etree.XML("""<?xml version="1.0" encoding="UTF-8"?> <review> <title>Some testing stuff</title> <text>this &amp; that.</text> </review>""") e = t.xpath('//text')[0] xml = etree.tostring(e) inner = re.match('<[^>]*?>(.*)</[^>]*>\s*$', xml, flags=re.DOTALL).group(1) print inner </code></pre> produces: <pre class="prettyprint"><code>this &amp; that. </code></pre> I used re.DOTALL to ensure this works for XML containing newlines.

How do I get the whole content between two xml tags in Python?

I try to get the whole content between an opening xml tag and it's closing counterpart.

Getting the content in straight cases like title below is easy, but how can I get the whole content between the tags if mixed-content is used and I want to preserve the inner tags?

<?xml version="1.0" encoding="UTF-8"?>
<review>
  <title>Some testing stuff</title>
  <text sometimes="attribute">Some text with <extradata>data</extradata> in it.
  It spans <sometag>multiple lines: <tag>one</tag>, <tag>two</tag> 
  or more</sometag>.</text>
</review>

What I want is the content between the two text tags, including any tags: Some text with <extradata>data</extradata> in it. It spans <sometag>multiple lines: <tag>one</tag>, <tag>two</tag> or more</sometag>.

For now I use regular expressions but it get's kinda messy and I don't like this approach. I lean towards a XML parser based solution. I looked over minidom, etree, lxml and BeautifulSoup but couldn't find a solution for this case (whole content, including inner tags).

How do I get the contents of a XML file in Python?

To read an XML file using ElementTree, firstly, we import the ElementTree class found inside xml library, under the name ET (common convension). Then passed the filename of the xml file to the ElementTree. parse() method, to enable parsing of our xml file. Then got the root (parent tag) of our xml file using getroot().

How to parse XML in Python string?

We use the ElementTree. fromstring() method to parse an XML string. The method returns root Element directly: a subtle difference compared with the ElementTree. parse() method which returns an ElementTree object.

What is XML etree ElementTree?

The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data. Changed in version 3.3: This module will use a fast implementation whenever available.

Here's something that works for me and your sample:

from lxml import etree
doc = etree.XML(
"""<?xml version="1.0" encoding="UTF-8"?>
<review>
  <title>Some testing stuff</title>
  <text>Some text with <extradata>data</extradata> in it.</text>
</review>"""
)

def flatten(seq):
  r = []
  for item in seq:
    if isinstance(item,(str,unicode)):
      r.append(unicode(item))
    elif isinstance(item,(etree._Element,)):
      r.append(etree.tostring(item,with_tail=False))
  return u"".join(r)

print flatten(doc.xpath('/review/text/node()'))

Yields:

Some text with <extradata>data</extradata> in it.

The xpath selects all child nodes of the <text> element and either renders them to unicode directly if they are a string/unicode subclass (<class 'lxml.etree._ElementStringResult'>) or calls etree.tostring on it if it's an Element, with_tail=False avoids duplication of the tail.

You may need to handle other node types if they are present.

from lxml import etree
t = etree.XML(
"""<?xml version="1.0" encoding="UTF-8"?>
<review>
  <title>Some testing stuff</title>
  <text>Some text with <extradata>data</extradata> in it.</text>
</review>"""
)
(t.text + ''.join(map(etree.tostring, t))).strip()

The trick here is that t is iterable, and when iterated, yields all child nodes. Because etree avoids text nodes, you also need to recover the text before the first child tag, with t.text.

In [50]: (t.text + ''.join(map(etree.tostring, t))).strip()
Out[50]: '<title>Some testing stuff</title>\n  <text>Some text with <extradata>data</extradata> in it.</text>'

Or:

In [6]: e = t.xpath('//text')[0]

In [7]: (e.text + ''.join(map(etree.tostring, e))).strip()
Out[7]: 'Some text with <extradata>data</extradata> in it.'

That is considerably easy with lxml*, using the parse() and tostring() functions:

from  lxml.etree import parse, tostring

First you parse the doc and get your element (I am using XPath, but you can use whatever you want):

doc = parse('test.xml')
element = doc.xpath('//text')[0]

The tostring() function returns a text representation of your element:

>>> tostring(element)
'<text>Some <text>text</text> with <extradata>data</extradata> in it.</text>\n'

However, you do not want the external elements, so we can remove them with a simple str.replace() call:

>>> tostring(element).replace('<%s>'%element.tag, '', 1)
'Some <text>text</text> with <extradata>data</extradata> in it.</text>\n'

Note that str.replace() received 1 as the third parameter, so it will remove only the first occurrence of the opening tag. One can do it with the closing tag, too. Now, instead of 1, we pass -1 to replace:

>>> tostring(element).replace('</%s>'%element.tag, '', -1)
'<text>Some <text>text with <extradata>data</extradata> in it.\n'

The solution, of course, is to do everything at once:

>>> tostring(element).replace('<%s>'%element.tag, '', 1).replace('</%s>'%element.tag, '', -1)
'Some <text>text with <extradata>data</extradata> in it.\n'

EDIT: @Charles made a good point: this code is fragile since the tag can have attributes. A possible yet still limited solution is to split the string at the first >:

>>> tostring(element).split('>', 1)
['<text',
 'Some <text>text</text> with <extradata>data</extradata> in it.</text>\n']

get the second resulting string:

>>> tostring(element).split('>', 1)[1]
'Some <text>text</text> with <extradata>data</extradata> in it.</text>\n'

then rsplitting it:

>>> tostring(element).split('>', 1)[1].rsplit('</', 1)
['Some <text>text</text> with <extradata>data</extradata> in it.', 'text>\n']

and finally getting the first result:

>>> tostring(element).split('>', 1)[1].rsplit('</', 1)[0]
'Some <text>text</text> with <extradata>data</extradata> in it.'

Nonetheless, this code is still fragile, since > is a perfectly valid char in XML, even inside attributes.

Anyway, I have to acknowledge that MattH solution is the real, general solution.

* Actually this solution works with ElementTree, too, which is great if you do not want to depend upon lxml. The only difference is that you will have no way of using XPath.

I like @Marcin's solution above, however I found that when using his 2nd option (converting a sub-node, not the root of the tree) it does not handle entities.

His code from above (modified to add an entity):

from lxml import etree
t = etree.XML("""<?xml version="1.0" encoding="UTF-8"?>
<review>
  <title>Some testing stuff</title>
    <text>this &amp; that.</text>
</review>""")
e = t.xpath('//text')[0]
print (e.text + ''.join(map(etree.tostring, e))).strip()

returns:

this & that.

with a bare/unescaped '&' character instead of a proper entity ('&').

My solution was to use to call etree.tostring at the node level (instead of on all children), then strip off the starting and ending tag using a regular expression:

import re
from lxml import etree
t = etree.XML("""<?xml version="1.0" encoding="UTF-8"?>
<review>
  <title>Some testing stuff</title>
    <text>this &amp; that.</text>
</review>""")

e = t.xpath('//text')[0]
xml = etree.tostring(e)
inner = re.match('<[^>]*?>(.*)</[^>]*>\s*$', xml, flags=re.DOTALL).group(1)
print inner

produces:

this &amp; that.

I used re.DOTALL to ensure this works for XML containing newlines.

How do I get the whole content between two xml tags in Python?

Tags:

python

xml

xml-parsing

lxml

Brutus

People also ask

4 Answers

MattH

Marcin

brandizzi

jdhildeb

Recent Activity

Donate For Us

How do I get the whole content between two xml tags in Python?

Tags:

python

xml

xml-parsing

lxml

Brutus

People also ask

4 Answers

MattH

Marcin

brandizzi

jdhildeb

Related questions

Recent Activity

Donate For Us