Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text between tags with XPath including markup

Tags:

python

xpath

I have the following piece of XML:

...<span class="st">In Tim <em>Power</em>: Politieman...</span>...

I want to extract the part between the <span> tags. For this I use XPath:

   /span[@class="st"]

This however will extract everything including the <span>. and.

  /span[@class="st"]/text()

will return a list of two text elements. One containing "In Tim". The other ":Politieman". The <em>..</em> is not included and is handled like a separator.

Is there a pure XPath solution which returns:

In Tim <em>Power</em>: Politieman...

EDIT Thanks to @helderdarocha and @TextGeek. Seems non trivial to extract plain text with XPath only including the <em>.

The /span[@class="st"]/node() solution creates a list containing the individual lines, from which it is trivial in Python to create a String.

like image 682
Pullie Avatar asked Jun 02 '14 20:06

Pullie


2 Answers

Sounds like you want the equivalent of the Javascript DOM innerHTML() function, but for XML. I don't think there's a way to do that in pure XPath.

XPath doesn't really operate on markup strings like "<em>" and "</em>" at all -- it works with a tree of Node objects (there might possibly be an XPath implementation that tries to work directly off markup, but I doubt it). Most XPath implementations wouldn't even have the 4 characters "<em>" anywhere (except maybe kept around for printing error messages or something), and of course the DOM could have been built from scratch rather than from XML or other input in the first place. Likewise, XPath doesn't really figure on handing back marked-up strings, but lists of nodes.

In XSLT or XQuery you can do this easily, but not in XPath by itself, unless I'm missing something.

-s

like image 43
TextGeek Avatar answered Sep 23 '22 17:09

TextGeek


To get any child node you can use:

/span[@class="st"]/node()

This will return:

  1. Two child text nodes
  2. The full <em> node (element and contents).

If you actually want all the text() nodes, including the ones inside em, then get all the text() descendants:

/span[@class="st"]//text()

or

/span[@class="st"]/descendant::text()

This will return three text nodes, the text inside <em>, but not the <em> elements.

like image 138
helderdarocha Avatar answered Sep 22 '22 17:09

helderdarocha