Scraping HTML inside JSON with Scrapy

Question

I'm requesting a website whose response is a JSON like this:

{
    "success": true,
    "response": "<html>... html goes here ...</html>"
}

I've seen both ways to scrap HTML or JSON, but haven't found how to scrap HTML inside a JSON. Is it possible to do this using scrapy?

paul trmbrth · Accepted Answer

One way is to build a scrapy.Selector out of the HTML inside the JSON data.

I'll assume you have the Response object with JSON data in it, available through response.text.

(Below, I'm building a test response to play with (I'm using scrapy 1.1 with Python 3):

response = scrapy.http.TextResponse(url='http://www.example.com/json', body=r'''
{
    "success": true,
    "response": "<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>"
}
''', encoding='utf8')

)

Using json module you can get the HTML data like this:

import json
data = json.loads(response.text)

You get something like :

>>> data
{'success': True, 'response': "<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>"}

Then you can build a new selector like this:

selector = scrapy.Selector(text=data['response'], type="html")

after which you can use XPath or CSS selectors on it:

>>> selector.xpath('//title/text()').extract()
['Example website']

Scraping HTML inside JSON with Scrapy

Tags:

scrapy

scrapy-spider

Ivan

1 Answers

paul trmbrth

Recent Activity

Donate For Us

Scraping HTML inside JSON with Scrapy

Tags:

scrapy

scrapy-spider

Ivan

1 Answers

paul trmbrth

Related questions

Recent Activity

Donate For Us