Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting non-contiguous text with lxml / ElementTree

Suppose I have this sort of HTML from which I need to select "text2" using lxml / ElementTree:

<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>

If I already have the div element as mydiv, then mydiv.text returns just "text1".

Using itertext() seems problematic or cumbersome at best since it walks the entire tree under the div.

Is there any simple/elegant way to extract a non-first text chunk from an element?

like image 844
GJ. Avatar asked Nov 29 '22 11:11

GJ.


1 Answers

Well, lxml.etree provides full XPath support, which allows you to address the text items:

>>> import lxml.etree
>>> fragment = '<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>'
>>> div = lxml.etree.fromstring(fragment)
>>> div.xpath('./text()')
['text1', 'text2', 'text3']
like image 156
Shane Holloway Avatar answered Dec 06 '22 10:12

Shane Holloway