Basically I need to scrape some text that has nested tags. Something like this: <pre class="prettyprint"><code><div id='theNode'> This is an example bolded text </div> </code></pre> And I want an expression that will produce this: <pre class="prettyprint"><code>This is an example bolded text </code></pre> I have been struggling with this for hour or more with no result. Any help is appreciated

If you are using scrapy in python, you can use <code>descendant-or-self::*/text()</code>. Full example: <pre class="prettyprint lang-py prettyprint-override"><code>txt = """<div id='theNode'> This is an example bolded text </div>""" selector = scrapy.Selector(text=txt, type="html") # Create HTML doc from HTML text all_txt = selector.xpath('//div/descendant-or-self::*/text()').getall() final_txt = ''.join( _ for _ in all_txt).strip() print(final_txt) # 'This is an example bolded text' </code></pre>

XPath expression for selecting all text in a given node, and the text of its chldren

Tags:

xpath

Basically I need to scrape some text that has nested tags.

Something like this:

<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>

And I want an expression that will produce this:

This is an example bolded text

I have been struggling with this for hour or more with no result.

Any help is appreciated

662

asked May 03 '12 02:05

Martin Taleski

2 Answers

If you are using scrapy in python, you can use descendant-or-self::*/text(). Full example:

txt = """<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>"""

selector = scrapy.Selector(text=txt, type="html") # Create HTML doc from HTML text
all_txt = selector.xpath('//div/descendant-or-self::*/text()').getall()
final_txt = ''.join( _ for _ in all_txt).strip()
print(final_txt) # 'This is an example bolded text'

answered Oct 03 '22 04:10

jerrymouse

The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.

You want to call the XPath string() function on the div element.

string(//div[@id='theNode'])

You can also use the normalize-space function to reduce unwanted whitespace that might appear due to newlines and indenting in the source document. This will remove leading and trailing whitespace and replace sequences of whitespace characters with a single space. When you pass a nodeset to normalize-space(), the nodeset will first be converted to it's string-value. If no arguments are passed to normalize-space it will use the context node.

normalize-space(//div[@id='theNode'])

// if theNode was the context node, you could use this instead
normalize-space()

You might want use a more efficient way of selecting the context node than the example XPath I have been using. eg, the following Javascript example can be run against this page in some browsers.

var el = document.getElementById('question');
var result = document.evaluate('normalize-space()', el, null ).stringValue;

The whitespace only text node between the span and b elements might be a problem.

130

answered Oct 03 '22 04:10

Lachlan Roche

Related questions
                            
                                Find the position of an element within its parent with XSLT / XPath
                            
                                how to disable dtd at runtime in java's xpath?
                            
                                Xpath/XSLT : check if following sibling is a particular node
                            
                                xsl:if at least one child node exists
                            
                                Case insensitive XML parser in c#
                            
                                Xpath Expression to match text correctly, but trim leading and trailing whitespace
                            
                                format and display datetime in xslt
                            
                                How to match a text node then follow parent nodes using XPath
                            
                                Python Selenium: Finds h1 element but returns empty text string
                            
                                How to get node name and values from an xml variable in t-sql
                            
                                Using Xpath with PHP to parse HTML
                            
                                Can I use xpath 2.0 with firefox and selenium?
                            
                                Ansible xml manipulation similar to lineinfile
                            
                                Return a string value based on XPATH condition
                            
                                Why is there no XPath syntax for namespace-qualified nodes?
                            
                                XSLT: use multiple or'd template matches to apply-templates
                            
                                XSLT 1.0 Get Current DateTime
                            
                                Most elegant way to query XML string using XPath
                            
                                How can I trim space in XSLT without replacing repating whitespaces by single ones?
                            
                                "Expression must evaluate to a node-set."

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With