import urllib
import lxml.html
down='http://blog.sina.com.cn/s/blog_71f3890901017hof.html'
file=urllib.urlopen(down).read()
root=lxml.html.document_fromstring(file)
body=root.xpath('//div[@class="articalContent "]')[0]
print body.text_content()
When i run the code, what i get is the text content ,how can i get the html source code of it,not the text content?
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.
Use
html = lxml.html.tostring(node)
and please: read the basic documentation of the tools you are using first.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With