Scrape ajax web page with python and/or scrapy

Tags:

What I want to do is scrape petition data - name, city, state, date, signature number - from one or more petitions at petitions.whitehouse.gov

I assume at this point python is the way to go - probably the scrapy library - along with some functions to deal with the ajax aspects of the site. The reason for this scraper is that this petition data is not available to the public.

I am a freelance tech journalist and I want to be able to dump each petition's data into a CSV file in order to analyze the number of people from each state who sign a state's petition, and with data from multiple petitions, find the number of people who sign multiple petitions, etc., and then make some conclusions about the political viability of the petition process and the data itself.

The petition functions at petitions.whitehouse.gov run as a Drupal module, and the White House developers responded to my issue request on github https://github.com/WhiteHouse/petition/issues/44 that they are working on an API to allow access to petition data from the module. But there is no release date for that API; and that doesn't solve the problem of the petition data currently on petitions.whitehouse.gov.

I've emailed the White House and the White House Developers, stating that I am a freelance journalist and asking for some way to access to the data. The White House Office of Digital Strategy told me that "Unfortunately, we don't have the means to provide data exports at this time, but we are working to open up the data going forward via the API." There is an "Open Data" initiative at the White House, but apparently petition data is not covered.

Privacy and TOS: There is little privacy expected in signing a petition. And no clear TOS that addresses web scraping of this data.

What has been done: Some faculty at UNC have written (what I assume is) a python script to scrape the data, but they don't want to release the script to me, saying they are still working on it. http://www.unc.edu/~ncaren/secessionists/ They did send me a CSV data dump of one petition I am particularly interested in.

What I've done: I've set up a github project for this, because I want any petition data scraper to be useful for everyone - petitioners themselves, journalists, etc. - who wants to be able to get this data. https://github.com/markratledge/whitehousescraper

I have no experience with python and little experience with shell scripting, and what I'm trying to do is obviously beyond my experience at this point.

I ran a GUI script to send a "spacebar" to the web browser every five seconds or so, and in that way scraped ~10,000 signatures by cutting and pasting the browser text into a text editor. From there, I could process the text with grep and awk into a CSV format. This, of course, doesn't work too well; Chrome bogged down with the size of the page, and it took hours to get that many signatures.

What I've found so far: from what I can gather from other SO questions and answers, it looks like Python and scrapy http://scrapy.org is the way to go to avoid problems with browsers. But the page uses an ajax function to load the next set of signatures. It appears that this is a "static" ajax request, because the URL doesn't change.

In Firebug, the JSON request headers appear to have a random string appended to them with a page number just before. Does this say anything towards what needs to be done? Does a script need to emulate and send these to the webserver?

Request URL:https://petitions.whitehouse.gov/signatures/more/50ab2aa8eab72abc4a000020/2/50b32771ee140f072e000001 Request URL:https://petitions.whitehouse.gov/signatures/more/50ab2aa8eab72abc4a000020/3/50b1040f6ce61c837e000006 Request URL:https://petitions.whitehouse.gov/signatures/more/50ab2aa8eab72abc4a000020/4/50afb3d7c988d47504000004

This is the JS function that loads the signatures on the page:

(function ($) {
Drupal.behaviors.morePetitions = {
  attach: function(context) {
    $('.petition-list .show-more-petitions-bar').unbind();
    $(".petition-list .show-more-petitions-bar").bind('click',
      function () {
        $('.show-more-petitions-bar').addClass('display-none');
        $('.loading-more-petitions-bar').removeClass('display-none');

        var petition_sort = retrieveSort();
        var petition_cols = retrieveCols();
        var petition_issues = retrieveIssues();
        var petition_search = retrieveSearch();
        var petition_page = parseInt($('#page-num').html());

        var url = "/petitions/more/"+petition_sort+"/"+(petition_page + 1)+"/"+petition_cols+"/"+petition_issues+"/"+petition_search+"/";
        var params = {};
        $.getJSON(url, params, function(data) {
          $('#petition-bars').remove();
          $('.loading-more-petitions-bar').addClass('display-none');
          $('.show-more-petitions-bar').removeClass('display-none');
          $(".petition-list .petitions").append(data.markup).show();

          if (typeof wh_petition_adjustHeight == 'function') {
            wh_petition_adjustHeight();
          }

          Drupal.attachBehaviors('.petition-list .show-more-petitions-bar');
          if (typeof wh_petition_page_update_links == 'function') {
            wh_petition_page_update_links();
          }
        });

        return false;
      }
    );
  }
}

and that is fired when this div is revealed when scrolling to the bottom of the browser window:

<a href="/petition/.../l76dWhwN?page=2&amp;last=50b3d98e7043012b24000011" class="load-next no-follow active" rel="509ec31cadfd958d58000005">Load Next 20 Signatures</a>
<div id="last-signature-id" class="display-none">50b3d98e7043012b24000011</div>

So, what's the best way to do this? Where do I go with scrapy? Or is there another python library better suited for this?

Feel free to comment, point me in a direction with code snips, to other SO questions/answers, contribute to github. What I'm trying to do is obviously beyond my experience at this point.

403

asked Nov 27 '12 15:11

markratledge

1 Answers

The 'random link' looks like it has the form:

https://petitions.whitehouse.gov/signatures/more/petitionid/ pagenum/ lastpetition where petitionid is static for a single petition, pagenum increments each time and lastpetition is returned each time from the request.

My usual approach would be to use the requests library to emulate a session for cookies and then work out what requests the browser is making.

import requests
s=requests.session()
url='http://httpbin.org/get'
params = {'cat':'Persian',
          'age':3,
          'name':'Furball'}             
s.get(url, params=params)

I'd pay particular attention to the following link:

<a href="/petition/shut-down-tar-sands-project-utah-it-begins-and-reject-keystone-xl-pipeline/H1MQJGMW?page=2&last=50b5a1f9ee140f227a00000b" class="load-next no-follow active" rel="50ae9207eab72aed25000003">Load Next 20 Signatures</a>

129

answered Oct 23 '22 09:10

Dragon

Related questions
                            
                                How to automatically create symmetric object on Django's ManyToManyField?
                            
                                Python: how to use lambda or partial to bind arguments other than the first positional argument
                            
                                How can I check programmatically the status of my task queue in Google Appengine?
                            
                                Low success rate with pytesser? Is this an issue of noise, or is there something else that needs to be done?
                            
                                How to set source host address in Python Logging?
                            
                                Are Instance variables of class-based-views persistent?
                            
                                Django testing: DatabaseError: no such table for ManyToManyField
                            
                                nosetest - get list of failed tests (without extra output)
                            
                                Django, python, mod_wsgi and Apache worker
                            
                                How to download huge Oracle LOB with cx_Oracle on memory constrained system?
                            
                                Tkinter messagebox not behaving like a modal dialog
                            
                                Wrong Tracker values on a 3D histogram
                            
                                sklearn.mixture.DPGMM: Unexpected results
                            
                                Python, Error audio Recording in 16000Hz using Pyaudio
                            
                                How to include overlength URLs in python comments
                            
                                Very slow interaction using Python's telnetlib
                            
                                mechanize._response.httperror_seek_wrapper: HTTP Error 404: Not Found
                            
                                Why Chinese garbled when use webpy but it's normal when use MySQLdb?
                            
                                Half-duplex serial communications in Python
                            
                                paramiko server mode port forwarding

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrape ajax web page with python and/or scrapy

Tags:

python

ajax

web-scraping

scrapy

markratledge

People also ask

1 Answers

Dragon

Recent Activity

Donate For Us