I'm requesting a website whose response is a JSON like this:
{
"success": true,
"response": "<html>... html goes here ...</html>"
}
I've seen both ways to scrap HTML or JSON, but haven't found how to scrap HTML inside a JSON. Is it possible to do this using scrapy?
One way is to build a scrapy.Selector
out of the HTML inside the JSON data.
I'll assume you have the Response
object with JSON data in it, available through response.text
.
(Below, I'm building a test response to play with (I'm using scrapy 1.1 with Python 3):
response = scrapy.http.TextResponse(url='http://www.example.com/json', body=r'''
{
"success": true,
"response": "<html>\n <head>\n <base href='http://example.com/' />\n <title>Example website</title>\n </head>\n <body>\n <div id='images'>\n <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n </div>\n </body>\n</html>"
}
''', encoding='utf8')
)
Using json
module you can get the HTML data like this:
import json
data = json.loads(response.text)
You get something like :
>>> data
{'success': True, 'response': "<html>\n <head>\n <base href='http://example.com/' />\n <title>Example website</title>\n </head>\n <body>\n <div id='images'>\n <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n </div>\n </body>\n</html>"}
Then you can build a new selector like this:
selector = scrapy.Selector(text=data['response'], type="html")
after which you can use XPath or CSS selectors on it:
>>> selector.xpath('//title/text()').extract()
['Example website']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With