Getting non-contiguous text with lxml / ElementTree

Question

Suppose I have this sort of HTML from which I need to select "text2" using lxml / ElementTree:

<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>

If I already have the div element as mydiv, then mydiv.text returns just "text1".

Using itertext() seems problematic or cumbersome at best since it walks the entire tree under the div.

Is there any simple/elegant way to extract a non-first text chunk from an element?

Shane Holloway · Accepted Answer

Well, lxml.etree provides full XPath support, which allows you to address the text items:

>>> import lxml.etree
>>> fragment = '<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>'
>>> div = lxml.etree.fromstring(fragment)
>>> div.xpath('./text()')
['text1', 'text2', 'text3']

Getting non-contiguous text with lxml / ElementTree

Tags:

python

html-parsing

lxml

elementtree

GJ.

1 Answers

Shane Holloway

Recent Activity

Donate For Us

Getting non-contiguous text with lxml / ElementTree

Tags:

python

html-parsing

lxml

elementtree

GJ.

1 Answers

Shane Holloway

Related questions

Recent Activity

Donate For Us