Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XPath expression for selecting all text in a given node, and the text of its chldren

Tags:

xpath

Basically I need to scrape some text that has nested tags.

Something like this:

<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>

And I want an expression that will produce this:

This is an example bolded text

I have been struggling with this for hour or more with no result.

Any help is appreciated

like image 662
Martin Taleski Avatar asked May 03 '12 02:05

Martin Taleski


People also ask

How do I get all text in XPath?

You want to call the XPath string() function on the div element. You can also use the normalize-space function to reduce unwanted whitespace that might appear due to newlines and indenting in the source document.

What will be the XPath expression to select?

In XPath, path expression is used to select nodes or node-sets in an XML document. The node is selected by following a path or steps. Let's take an example to see the syntax of XPath. Here, we take an XML document.

Which is the child path operator in XPath?

Path (Children): the child operator ('/') selects from immediate children of the left-side collection. Descendants: the descendant operator ('//') selects from arbitrary descendants of the left-side collection.

What is the expression used for anything in XPath?

We know that XPath uses a path expression to select node or a list of nodes from an XML document. It specifies that selection starts from the root node.


2 Answers

If you are using scrapy in python, you can use descendant-or-self::*/text(). Full example:

txt = """<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>"""

selector = scrapy.Selector(text=txt, type="html") # Create HTML doc from HTML text
all_txt = selector.xpath('//div/descendant-or-self::*/text()').getall()
final_txt = ''.join( _ for _ in all_txt).strip()
print(final_txt) # 'This is an example bolded text'
like image 20
jerrymouse Avatar answered Oct 03 '22 04:10

jerrymouse


The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.

You want to call the XPath string() function on the div element.

string(//div[@id='theNode'])

You can also use the normalize-space function to reduce unwanted whitespace that might appear due to newlines and indenting in the source document. This will remove leading and trailing whitespace and replace sequences of whitespace characters with a single space. When you pass a nodeset to normalize-space(), the nodeset will first be converted to it's string-value. If no arguments are passed to normalize-space it will use the context node.

normalize-space(//div[@id='theNode'])

// if theNode was the context node, you could use this instead
normalize-space()

You might want use a more efficient way of selecting the context node than the example XPath I have been using. eg, the following Javascript example can be run against this page in some browsers.

var el = document.getElementById('question');
var result = document.evaluate('normalize-space()', el, null ).stringValue;

The whitespace only text node between the span and b elements might be a problem.

like image 130
Lachlan Roche Avatar answered Oct 03 '22 04:10

Lachlan Roche