<p>See this html</p> <pre class="prettyprint"><code><div> <p> <span class="abc">Monitor</span> <b>$300</b> </p> <a href="/add">Add to cart</a> </div> <div> <p> <span class="abc">Keyboard</span> $20 </p> <a href="/add">Add to cart</a> </div> </code></pre> <p>Using xpath I want to parse <code>Monitor $300</code> and <code>Keyboard $20</code>. I use this xpath</p> <pre class="prettyprint"><code> //div[a[contains(., "Add to cart")]]/p/text() </code></pre> <p>But it selects <code><span class="abc">Monitor</span> <b>$300</b></code>. I don't want the tags. How do I get only the text?</p>

<p>You want to select all <em>descendant</em> text, not just child text:</p> <pre class="prettyprint"><code>//div[a[contains(., "Add to cart")]]/p//text() </code></pre> <p>Note the double slash between <code>p</code> and <code>text()</code> there.</p> <p>This potentially will also include a lot of inter-tag whitespace though, you you'll need to clean that up. Example using <code>lxml</code>:</p> <pre class="prettyprint"><code>>>> import lxml.etree as ET >>> tree = ET.fromstring('''<div> ... <div> ... <p> ... <span class="abc">Monitor</span> <b>$300</b> ... </p> ... <a href="/add">Add to cart</a> ... </div> ... <div> ... <p> ... <span class="abc">Keyboard</span> $20 ... </p> ... <a href="/add">Add to cart</a> ... </div> ... </div>''') >>> tree.xpath('//div[a[contains(., "Add to cart")]]/p//text()') ['\n ', 'Monitor', ' ', '$300', '\n ', '\n ', 'Keyboard', ' $20 \n '] >>> res = _ >>> [txt for txt in (txt.strip() for txt in res) if txt] ['Monitor', '$300', 'Keyboard', '$20'] </code></pre>

Get text content of an HTML element using XPath?

Tags:

html

xml

html-parsing

xpath

See this html

<div>     <p>     <span class="abc">Monitor</span> <b>$300</b>     </p>     <a href="/add">Add to cart</a> </div> <div>     <p>     <span class="abc">Keyboard</span> $20      </p>     <a href="/add">Add to cart</a> </div>

Using xpath I want to parse Monitor $300 and Keyboard $20. I use this xpath

 //div[a[contains(., "Add to cart")]]/p/text()

But it selects <span class="abc">Monitor</span> <b>$300</b>. I don't want the tags. How do I get only the text?

538

asked Jan 31 '13 17:01

Genghis Khan

1 Answers

You want to select all descendant text, not just child text:

//div[a[contains(., "Add to cart")]]/p//text()

Note the double slash between p and text() there.

This potentially will also include a lot of inter-tag whitespace though, you you'll need to clean that up. Example using lxml:

>>> import lxml.etree as ET >>> tree = ET.fromstring('''<div> ... <div> ...     <p> ...     <span class="abc">Monitor</span> <b>$300</b> ...     </p> ...     <a href="/add">Add to cart</a> ... </div> ... <div> ...     <p> ...     <span class="abc">Keyboard</span> $20  ...     </p> ...     <a href="/add">Add to cart</a> ... </div> ... </div>''') >>> tree.xpath('//div[a[contains(., "Add to cart")]]/p//text()') ['\n    ', 'Monitor', ' ', '$300', '\n    ', '\n    ', 'Keyboard', ' $20 \n    '] >>> res = _ >>> [txt for txt in (txt.strip() for txt in res) if txt] ['Monitor', '$300', 'Keyboard', '$20']

130

answered Sep 22 '22 19:09

Martijn Pieters

Related questions
                            
                                How does GitHub change the URL without reloading a page?
                            
                                AngualrJS: sustaining data on html refresh
                            
                                Left and right align on same line
                            
                                How to allow 'Open in a new tab' when using ng-click?
                            
                                How to get html elements from an object tag?
                            
                                Bootstrap: In a modal dialog, how do I make the dropdown menu expand outside the dialog?
                            
                                Is there a centralized list of country names that can be used for web drop down boxes (and validation) [closed]
                            
                                input textbox hidden behind keyboard on android Chrome
                            
                                How to reuse css class content in another class without copying?
                            
                                Create SHA-256 hash from a Blob/File in javascript
                            
                                How to make a ui-select field as required?
                            
                                HTML: can my favicon be on a CDN instead of /favicon.ico?
                            
                                HTML5 How To Skip Navigation When Name Attribute Is Obsolete
                            
                                What does a colon mean within an HTML id attribute?
                            
                                How can I align 3 divs side by side?
                            
                                Is there a future proof way to add properties to native browser objects given "use strict" behavior?
                            
                                What is <router-view :key="$route.fullPath"> ?
                            
                                Include JavaScript file in HTML won't work as <script .... />
                            
                                How does HTML tags work inside script tag?
                            
                                Inserting HTML tag in the middle of Arabic word breaks word connection (cursive)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With