Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web scraping with Google Apps Script

I'm trying to pull data from the following sample web page using Google Apps Script:

url = http://www.premierleague.com/players/2064/Wayne-Rooney/stats?se=54

using, UrlFetchApp.Fetch(url)

The problem is when I use UrlFetchApp.Fetch(url) to do that, I don't get the page information defined by the 'se' parameter in the url. Instead, I get the information on the following URL because it looks like the 'se=54' page is asynchronously loaded: http://www.premierleague.com/players/2064/Wayne-Rooney/stats

Is there any way to pass the parameter 'se' some other way? I was looking at the function and it allows the specification of 'options', as they are referred to, but the documentation on the topic is very limited.

Any help would be most appreciated. Many thanks

Tommy

like image 409
Tommy Avatar asked Jul 30 '16 17:07

Tommy


People also ask

Is web scraping Google allowed?

It also imposes limitations on its own API, only allowing a maximum of 10,000 requests per day. From Google's perspective, web scraping is a ToS violation and a bad move overall. Still, Google isn't known to sue for scraping its content.

How do I use Google Web scraper?

Install the extension and open the Web Scraper tab in developer tools (which has to be placed at the bottom of the screen); 2. Create a new sitemap; 3. Add data extraction selectors to the sitemap; 4. Lastly, launch the scraper and export scraped data.


2 Answers

Go to that website in your browser and open the developer tools (F12 or ctr-shift-i). Click on the network tab and reload the page with F5. A list of requests will appear. At the bottom of the list you should see the asynchronous requests made to fetch the information. Those requests get the data in json form from footballapi.pulselive.com. You can do the same thing in apps script. But you have to send a correct "origin" header line or your request gets rejected. Here is an example.

function fetchData() {
  var url = "http://footballapi.pulselive.com/football/stats/player/2064?comps=1";
  var options = {
    "headers": {
      "Origin": "http://www.premierleague.com"
    }
  }
  var json = JSON.parse(UrlFetchApp.fetch(url, options).getContentText()); 
  for(var i = 0; i < json.stats.length; i++) {
    if(json.stats[i].name === "goals") Logger.log(json.stats[i]);
  }
}
like image 190
SpiderPig Avatar answered Oct 20 '22 09:10

SpiderPig


Please try the following solution:

var options =
{
   "method"  : "GET",   
   "followRedirects" : true,
   "muteHttpExceptions": true
};

var result = UrlFetchApp.fetch(url, options);
like image 40
Eugene Avatar answered Oct 20 '22 09:10

Eugene