Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrape ajax web page with python and/or scrapy

What I want to do is scrape petition data - name, city, state, date, signature number - from one or more petitions at petitions.whitehouse.gov

I assume at this point python is the way to go - probably the scrapy library - along with some functions to deal with the ajax aspects of the site. The reason for this scraper is that this petition data is not available to the public.

I am a freelance tech journalist and I want to be able to dump each petition's data into a CSV file in order to analyze the number of people from each state who sign a state's petition, and with data from multiple petitions, find the number of people who sign multiple petitions, etc., and then make some conclusions about the political viability of the petition process and the data itself.

The petition functions at petitions.whitehouse.gov run as a Drupal module, and the White House developers responded to my issue request on github https://github.com/WhiteHouse/petition/issues/44 that they are working on an API to allow access to petition data from the module. But there is no release date for that API; and that doesn't solve the problem of the petition data currently on petitions.whitehouse.gov.

I've emailed the White House and the White House Developers, stating that I am a freelance journalist and asking for some way to access to the data. The White House Office of Digital Strategy told me that "Unfortunately, we don't have the means to provide data exports at this time, but we are working to open up the data going forward via the API." There is an "Open Data" initiative at the White House, but apparently petition data is not covered.

Privacy and TOS: There is little privacy expected in signing a petition. And no clear TOS that addresses web scraping of this data.

What has been done: Some faculty at UNC have written (what I assume is) a python script to scrape the data, but they don't want to release the script to me, saying they are still working on it. http://www.unc.edu/~ncaren/secessionists/ They did send me a CSV data dump of one petition I am particularly interested in.

What I've done: I've set up a github project for this, because I want any petition data scraper to be useful for everyone - petitioners themselves, journalists, etc. - who wants to be able to get this data. https://github.com/markratledge/whitehousescraper

I have no experience with python and little experience with shell scripting, and what I'm trying to do is obviously beyond my experience at this point.

I ran a GUI script to send a "spacebar" to the web browser every five seconds or so, and in that way scraped ~10,000 signatures by cutting and pasting the browser text into a text editor. From there, I could process the text with grep and awk into a CSV format. This, of course, doesn't work too well; Chrome bogged down with the size of the page, and it took hours to get that many signatures.

What I've found so far: from what I can gather from other SO questions and answers, it looks like Python and scrapy http://scrapy.org is the way to go to avoid problems with browsers. But the page uses an ajax function to load the next set of signatures. It appears that this is a "static" ajax request, because the URL doesn't change.

In Firebug, the JSON request headers appear to have a random string appended to them with a page number just before. Does this say anything towards what needs to be done? Does a script need to emulate and send these to the webserver?

Request URL:https://petitions.whitehouse.gov/signatures/more/50ab2aa8eab72abc4a000020/2/50b32771ee140f072e000001 Request URL:https://petitions.whitehouse.gov/signatures/more/50ab2aa8eab72abc4a000020/3/50b1040f6ce61c837e000006 Request URL:https://petitions.whitehouse.gov/signatures/more/50ab2aa8eab72abc4a000020/4/50afb3d7c988d47504000004

This is the JS function that loads the signatures on the page:

(function ($) {
Drupal.behaviors.morePetitions = {
  attach: function(context) {
    $('.petition-list .show-more-petitions-bar').unbind();
    $(".petition-list .show-more-petitions-bar").bind('click',
      function () {
        $('.show-more-petitions-bar').addClass('display-none');
        $('.loading-more-petitions-bar').removeClass('display-none');

        var petition_sort = retrieveSort();
        var petition_cols = retrieveCols();
        var petition_issues = retrieveIssues();
        var petition_search = retrieveSearch();
        var petition_page = parseInt($('#page-num').html());

        var url = "/petitions/more/"+petition_sort+"/"+(petition_page + 1)+"/"+petition_cols+"/"+petition_issues+"/"+petition_search+"/";
        var params = {};
        $.getJSON(url, params, function(data) {
          $('#petition-bars').remove();
          $('.loading-more-petitions-bar').addClass('display-none');
          $('.show-more-petitions-bar').removeClass('display-none');
          $(".petition-list .petitions").append(data.markup).show();

          if (typeof wh_petition_adjustHeight == 'function') {
            wh_petition_adjustHeight();
          }

          Drupal.attachBehaviors('.petition-list .show-more-petitions-bar');
          if (typeof wh_petition_page_update_links == 'function') {
            wh_petition_page_update_links();
          }
        });

        return false;
      }
    );
  }
}

and that is fired when this div is revealed when scrolling to the bottom of the browser window:

<a href="/petition/.../l76dWhwN?page=2&amp;last=50b3d98e7043012b24000011" class="load-next no-follow active" rel="509ec31cadfd958d58000005">Load Next 20 Signatures</a>
<div id="last-signature-id" class="display-none">50b3d98e7043012b24000011</div>

So, what's the best way to do this? Where do I go with scrapy? Or is there another python library better suited for this?

Feel free to comment, point me in a direction with code snips, to other SO questions/answers, contribute to github. What I'm trying to do is obviously beyond my experience at this point.

like image 403
markratledge Avatar asked Nov 27 '12 15:11

markratledge


People also ask

Can Scrapy handle Ajax?

Ajax is just an asynchronous request that can be easily replicated with scrapy or anything else for that matter. It's true however, that you can use something like selenium to render the page with all of the ajax requests and bells and whistles if you are looking for lazy, do-it-all approach.

Is Scrapy good for web scraping?

Scrapy, being one of the most popular web scraping frameworks, is a great choice if you want to learn how to scrape data from the web. In this tutorial, you'll learn how to get started with Scrapy and you'll also implement an example project to scrape an e-commerce website.

Which is better for web scraping JavaScript or Python?

Python is your best bet. Libraries such as requests or HTTPX makes it very easy to scrape websites that don't require JavaScript to work correctly. Python offers a lot of simple-to-use HTTP clients. And once you get the response, it's also very easy to parse the HTML with BeautifulSoup for example.


1 Answers

The 'random link' looks like it has the form:

https://petitions.whitehouse.gov/signatures/more/petitionid/ pagenum/ lastpetition where petitionid is static for a single petition, pagenum increments each time and lastpetition is returned each time from the request.

My usual approach would be to use the requests library to emulate a session for cookies and then work out what requests the browser is making.

import requests
s=requests.session()
url='http://httpbin.org/get'
params = {'cat':'Persian',
          'age':3,
          'name':'Furball'}             
s.get(url, params=params)

I'd pay particular attention to the following link:

<a href="/petition/shut-down-tar-sands-project-utah-it-begins-and-reject-keystone-xl-pipeline/H1MQJGMW?page=2&amp;last=50b5a1f9ee140f227a00000b" class="load-next no-follow active" rel="50ae9207eab72aed25000003">Load Next 20 Signatures</a>

like image 129
Dragon Avatar answered Oct 23 '22 09:10

Dragon