Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get text for a root element using lxml?

Tags:

python

lxml

I'm completely stumped why lxml .text will give me the text for a child tag but for the root tag.

some_tag = etree.fromstring('<some_tag class="abc"><strong>Hello</strong> World</some_tag>')

some_tag.find("strong")
Out[195]: <Element strong at 0x7427d00>

some_tag.find("strong").text
Out[196]: 'Hello'

some_tag
Out[197]: <Element some_tag at 0x7bee508>

some_tag.text

some_tag.find("strong").text returns the text between the <strong> tag.

I expect some_tag.text to return everything between <some_tag> ... </some_tag>

Expected:

<strong>Hello</strong> World

Instead, it returns nothing.

like image 596
Jason Wirth Avatar asked Apr 21 '12 11:04

Jason Wirth


People also ask

What is Etree in lxml?

lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.

What does lxml do in Python?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).


1 Answers

from lxml import etree

XML = '<some_tag class="abc"><strong>Hello</strong> World</some_tag>'

some_tag = etree.fromstring(XML)

for element in some_tag:
    print element.tag, element.text, element.tail

Output:

strong Hello  World

For information on the .text and .tail properties, see:

  • http://lxml.de/tutorial.html#elements-contain-text
  • http://infohost.nmt.edu/tcc/help/pubs/pylxml/web/etree-view.html

To get exactly the result that you expected, use

print etree.tostring(some_tag.find("strong"))

Output:

<strong>Hello</strong> World
like image 102
mzjn Avatar answered Oct 05 '22 20:10

mzjn