Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to return result as HTML with HtmlXPathSelector (Scrapy)

How do I retrieve all the HTML contained inside a tag?

hxs = HtmlXPathSelector(response)
element = hxs.select('//span[@class="title"]/')

Perhaps something like:

hxs.select('//span[@class="title"]/html()')

EDIT: If I look at the documentation, I see only methods to return a new XPathSelectorList, or just the raw text inside a tag. I want to retrieve not a new list or just text, but the source code HTML inside a tag. e.g.:

<html>
    <head>
        <title></title>
    </head>
    <body>
        <div id="leexample">
            justtext
            <p class="ihatelookingforfeatures">
                sometext
            </p>
            <p class="yahc">
                sometext
            </p>
        </div>
        <div id="lenot">
            blabla
        </div>
    an awfuly long example for this.
    </body>
</html>

I want to do a method like such hxs.select('//div[@id="leexample"]/html()') that shall return me the HTML inside of it, like this:

justtext
<p class="ihatelookingforfeatures">
    sometext
</p>
<p class="yahc">
    sometext
</p>

I hope I cleared the ambiguousness around my question.

How to get the HTML from an HtmlXPathSelector in Scrapy? (perhaps a solution outside scrapy's scope?)

like image 379
mirandalol Avatar asked Dec 26 '22 21:12

mirandalol


1 Answers

Call .extract() on your XpathSelectorList. It shall return a list of unicode strings contains the HTML content you want.

hxs.select('//div[@id="leexample"]/*').extract()

Update

# This is wrong
hxs.select('//div[@id="leexample"]/html()').extract()

/html() is not a valid scrapy selector. To extract all children, use '//div[@id="leexample"]/*' or '//div[@id="leexample"]/node()'. Note that, node() will return textNode, the result kind of like:

[u'\n   ',
 u'&lta href="image1.html">Name: My image 1 
' ]
like image 197
xiaowl Avatar answered Jan 01 '23 09:01

xiaowl