How do I retrieve all the HTML contained inside a tag?
hxs = HtmlXPathSelector(response)
element = hxs.select('//span[@class="title"]/')
Perhaps something like:
hxs.select('//span[@class="title"]/html()')
EDIT:
If I look at the documentation, I see only methods to return a new XPathSelectorList
, or just the raw text inside a tag.
I want to retrieve not a new list or just text, but the source code HTML inside a tag.
e.g.:
<html>
<head>
<title></title>
</head>
<body>
<div id="leexample">
justtext
<p class="ihatelookingforfeatures">
sometext
</p>
<p class="yahc">
sometext
</p>
</div>
<div id="lenot">
blabla
</div>
an awfuly long example for this.
</body>
</html>
I want to do a method like such hxs.select('//div[@id="leexample"]/html()')
that shall return me the HTML inside of it, like this:
justtext
<p class="ihatelookingforfeatures">
sometext
</p>
<p class="yahc">
sometext
</p>
I hope I cleared the ambiguousness around my question.
How to get the HTML from an HtmlXPathSelector
in Scrapy? (perhaps a solution outside scrapy's scope?)
Call .extract()
on your XpathSelectorList
. It shall return a list of unicode strings contains the HTML content you want.
hxs.select('//div[@id="leexample"]/*').extract()
# This is wrong
hxs.select('//div[@id="leexample"]/html()').extract()
/html()
is not a valid scrapy selector. To extract all children, use '//div[@id="leexample"]/*'
or '//div[@id="leexample"]/node()'
. Note that, node()
will return textNode
, the result kind of like:
[u'\n ', u'<a href="image1.html">Name: My image 1
' ]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With