Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the source of html in lxml?

Tags:

python

lxml

import urllib
import lxml.html
down='http://blog.sina.com.cn/s/blog_71f3890901017hof.html'
file=urllib.urlopen(down).read()
root=lxml.html.document_fromstring(file)
body=root.xpath('//div[@class="articalContent  "]')[0]
print body.text_content()

When i run the code, what i get is the text content ,how can i get the html source code of it,not the text content?

like image 881
Bqsj Sjbq Avatar asked Dec 31 '12 06:12

Bqsj Sjbq


People also ask

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

Is XML and lxml are same?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.


1 Answers

Use

html = lxml.html.tostring(node)

and please: read the basic documentation of the tools you are using first.

like image 157
Andreas Jung Avatar answered Sep 20 '22 16:09

Andreas Jung