Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incredibly basic lxml questions: getting HTML/string content of lxml.etree._Element?

Tags:

python

lxml

This is such a basic question that I actually can't find it in the docs :-/

In the following:

img = house_tree.xpath('//img[@id="mainphoto"]')[0] 

How do I get the HTML of the <img/> tag?

I've tried adding html_content() but get AttributeError: 'lxml.etree._Element' object has no attribute 'html_content'.

Also, it was a tag with some content inside (e.g. <p>text</p>) how would I get the content (e.g. text)?

Many thanks!

like image 955
AP257 Avatar asked Mar 22 '11 18:03

AP257


People also ask

What is Etree in lxml?

lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.

What is lxml HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML). Contents. Parsers. Parser options.

Is XML and lxml are same?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.


1 Answers

I suppose it will be as simple as:

from lxml.etree import tostring inner_html = tostring(img) 

As for getting content from inside <p>, say, some selected element el:

content = el.text_content() 
like image 124
vonPetrushev Avatar answered Oct 11 '22 23:10

vonPetrushev