How do you use Scrapy to scrape web requests that return JSON? For example, the JSON would look like this:
{ "firstName": "John", "lastName": "Smith", "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021" }, "phoneNumber": [ { "type": "home", "number": "212 555-1234" }, { "type": "fax", "number": "646 555-4567" } ] }
I would be looking to scrape specific items (e.g. name
and fax
in the above) and save to csv.
Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
While working with Scrapy, one needs to create scrapy project. In Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name gfgfetch.py python file. Move to the spider folder and create gfgfetch.py .
Can I use Scrapy with BeautifulSoup? ¶ Yes, you can. As mentioned above, BeautifulSoup can be used for parsing HTML responses in Scrapy callbacks.
It's the same as using Scrapy's HtmlXPathSelector
for html responses. The only difference is that you should use json
module to parse the response:
class MySpider(BaseSpider): ... def parse(self, response): jsonresponse = json.loads(response.text) item = MyItem() item["firstName"] = jsonresponse["firstName"] return item
Hope that helps.
Don't need to use json
module to parse the reponse object.
class MySpider(BaseSpider): ... def parse(self, response): jsonresponse = response.json() item = MyItem() item["firstName"] = jsonresponse.get("firstName", "") return item
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With