Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy, scraping data inside a Javascript

Tags:

I am using scrapy to screen scrape data from a website. However, the data I wanted wasn't inside the html itself, instead, it is from a javascript. So, my question is:

How to get the values (text values) of such cases?

This, is the site I'm trying to screen scrape: https://www.mcdonalds.com.sg/locate-us/

Attributes I'm trying to get: Address, Contact, Operating hours.

If you do a "right click", "view source" inside a chrome browser you will see that such values aren't available itself in the HTML.


Edit

Sry paul, i did what you told me to, found the admin-ajax.php and saw the body but, I'm really stuck now.

How do I retrieve the values from the json object and store it into a variable field of my own? It would be good, if you could share how to do just one attribute for the public and to those who just started scrapy as well.

Here's my code so far

Items.py

class McDonaldsItem(Item): name = Field() address = Field() postal = Field() hours = Field() 

McDonalds.py

from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector import re  from fastfood.items import McDonaldsItem  class McDonaldSpider(BaseSpider): name = "mcdonalds" allowed_domains = ["mcdonalds.com.sg"] start_urls = ["https://www.mcdonalds.com.sg/locate-us/"]  def parse_json(self, response):      js = json.loads(response.body)     pprint.pprint(js) 

Sry for long edit, so in short, how do i store the json value into my attribute? for eg

***item['address'] = * how to retrieve ****

P.S, not sure if this helps but, i run these scripts on the cmd line using

scrapy crawl mcdonalds -o McDonalds.json -t json ( to save all my data into a json file )

I cannot stress enough on how thankful i feel. I know it's kind of unreasonable to ask this of u, will totally be okay even if you dont have time for this.

like image 830
HeadAboutToExplode Avatar asked Sep 26 '13 07:09

HeadAboutToExplode


People also ask

Can we use Scrapy in JavaScript?

Executing JavaScript in Scrapy with Selenium Locally, you can interact with a headless browser with Scrapy with the scrapy-selenium middleware. Selenium is a framework to interact with browsers commonly used for testing applications, web scraping and taking screenshots.

Is Scrapy better than BeautifulSoup?

Due to the built-in support for generating feed exports in multiple formats, as well as selecting and extracting data from various sources, the performance of Scrapy can be said to be faster than Beautiful Soup. Working with Beautiful Soup can speed up with the help of Multithreading process.

How do you scrape a dynamic website with Scrapy?

We use parse method and call this function, this function is used to extracts data from the sites, however, to scrape the sites it is necessary to understand the command response selector CSS and XPath. Request: It is a request which realizes a call for objects or data. Response: It obtains an answer to the Request.

Is Scrapy better than selenium?

Selenium is an excellent automation tool and Scrapy is by far the most robust web scraping framework. When we consider web scraping, in terms of speed and efficiency Scrapy is a better choice. While dealing with JavaScript based websites where we need to make AJAX/PJAX requests, Selenium can work better.


1 Answers

(I posted this to scrapy-users mailing list but by Paul's suggestion I'm posting it here as it complements the answer with the shell command interaction.)

Generally, websites that use a third party service to render some data visualization (map, table, etc) have to send the data somehow, and in most cases this data is accessible from the browser.

For this case, an inspection (i.e. exploring the requests made by the browser) shows that the data is loaded from a POST request to https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php

So, basically you have there all the data you want in a nice json format ready for consuming.

Scrapy provides the shell command which is very convenient to thinker with the website before writing the spider:

$ scrapy shell https://www.mcdonalds.com.sg/locate-us/ 2013-09-27 00:44:14-0400 [scrapy] INFO: Scrapy 0.16.5 started (bot: scrapybot) ...  In [1]: from scrapy.http import FormRequest  In [2]: url = 'https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php'  In [3]: payload = {'action': 'ws_search_store_location', 'store_name':'0', 'store_area':'0', 'store_type':'0'}  In [4]: req = FormRequest(url, formdata=payload)  In [5]: fetch(req) 2013-09-27 00:45:13-0400 [default] DEBUG: Crawled (200) <POST https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php> (referer: None) ...  In [6]: import json  In [7]: data = json.loads(response.body)  In [8]: len(data['stores']['listing']) Out[8]: 127  In [9]: data['stores']['listing'][0] Out[9]:  {u'address': u'678A Woodlands Avenue 6<br/>#01-05<br/>Singapore 731678',  u'city': u'Singapore',  u'id': 78,  u'lat': u'1.440409',  u'lon': u'103.801489',  u'name': u"McDonald's Admiralty",  u'op_hours': u'24 hours<br>\r\nDessert Kiosk: 0900-0100',  u'phone': u'68940513',  u'region': u'north',  u'type': [u'24hrs', u'dessert_kiosk'],  u'zip': u'731678'} 

In short: in your spider you have to return the FormRequest(...) above, then in the callback load the json object from response.body and finally for each store's data in the list data['stores']['listing'] create an item with the wanted values.

Something like this:

class McDonaldSpider(BaseSpider):     name = "mcdonalds"     allowed_domains = ["mcdonalds.com.sg"]     start_urls = ["https://www.mcdonalds.com.sg/locate-us/"]      def parse(self, response):         # This receives the response from the start url. But we don't do anything with it.         url = 'https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php'         payload = {'action': 'ws_search_store_location', 'store_name':'0', 'store_area':'0', 'store_type':'0'}         return FormRequest(url, formdata=payload, callback=self.parse_stores)      def parse_stores(self, response):         data = json.loads(response.body)         for store in data['stores']['listing']:             yield McDonaldsItem(name=store['name'], address=store['address']) 
like image 118
R. Max Avatar answered Jan 01 '23 14:01

R. Max