Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping HTML inside JSON with Scrapy

I'm requesting a website whose response is a JSON like this:

{
    "success": true,
    "response": "<html>... html goes here ...</html>"
}

I've seen both ways to scrap HTML or JSON, but haven't found how to scrap HTML inside a JSON. Is it possible to do this using scrapy?

like image 412
Ivan Avatar asked Jun 13 '16 14:06

Ivan


1 Answers

One way is to build a scrapy.Selector out of the HTML inside the JSON data.

I'll assume you have the Response object with JSON data in it, available through response.text.

(Below, I'm building a test response to play with (I'm using scrapy 1.1 with Python 3):

response = scrapy.http.TextResponse(url='http://www.example.com/json', body=r'''
{
    "success": true,
    "response": "<html>\n <head>\n  <base href='http://example.com/' />\n  <title>Example website</title>\n </head>\n <body>\n  <div id='images'>\n   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n  </div>\n </body>\n</html>"
}
''', encoding='utf8')

)

Using json module you can get the HTML data like this:

import json
data = json.loads(response.text)

You get something like :

>>> data
{'success': True, 'response': "<html>\n <head>\n  <base href='http://example.com/' />\n  <title>Example website</title>\n </head>\n <body>\n  <div id='images'>\n   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n  </div>\n </body>\n</html>"}

Then you can build a new selector like this:

selector = scrapy.Selector(text=data['response'], type="html")

after which you can use XPath or CSS selectors on it:

>>> selector.xpath('//title/text()').extract()
['Example website']
like image 115
paul trmbrth Avatar answered Oct 31 '22 06:10

paul trmbrth