Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping a JSON response with Scrapy

How do you use Scrapy to scrape web requests that return JSON? For example, the JSON would look like this:

{     "firstName": "John",     "lastName": "Smith",     "age": 25,     "address": {         "streetAddress": "21 2nd Street",         "city": "New York",         "state": "NY",         "postalCode": "10021"     },     "phoneNumber": [         {             "type": "home",             "number": "212 555-1234"         },         {             "type": "fax",             "number": "646 555-4567"         }     ] } 

I would be looking to scrape specific items (e.g. name and fax in the above) and save to csv.

like image 508
Thomas Kingaroy Avatar asked Aug 11 '13 12:08

Thomas Kingaroy


People also ask

How do you get a Scrapy response?

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

How do you scrape data from Scrapy?

While working with Scrapy, one needs to create scrapy project. In Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name gfgfetch.py python file. Move to the spider folder and create gfgfetch.py .

Can you use BeautifulSoup with Scrapy?

Can I use Scrapy with BeautifulSoup? ¶ Yes, you can. As mentioned above, BeautifulSoup can be used for parsing HTML responses in Scrapy callbacks.


2 Answers

It's the same as using Scrapy's HtmlXPathSelector for html responses. The only difference is that you should use json module to parse the response:

class MySpider(BaseSpider):     ...       def parse(self, response):          jsonresponse = json.loads(response.text)           item = MyItem()          item["firstName"] = jsonresponse["firstName"]                        return item 

Hope that helps.

like image 104
alecxe Avatar answered Oct 01 '22 10:10

alecxe


Don't need to use json module to parse the reponse object.

class MySpider(BaseSpider): ...   def parse(self, response):      jsonresponse = response.json()       item = MyItem()      item["firstName"] = jsonresponse.get("firstName", "")                  return item 
like image 30
HARVYS 789 Avatar answered Oct 01 '22 11:10

HARVYS 789