Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dissecting the script Aaron Swartz used to download several thousand articles from Jstor's archive

Aaron Swartz played an important role in shaping the internet during its earlier years. For those familiar with Aaron, you likely know he committed suicide after facing up to 35 years in prison for downloading a massive number of articles from jstor's archive, a digital library of academic journals and books. The script he used to download the articles was released and is pictured below. (Here's a link Aaron's documentary for those interested.)

keepgrabbing.py

This is the code:

import subprocess, urllib, random
class NoBlocks(Exception): pass
def getblocks():
    r = urllib.urlopen("http://{?REDACTED?}/grab").read()
    if '<html' in r.lower(): raise NoBlocks
    return r.split()


import sys
if len(sys.argv) > 1:
    prefix = ['--socks5', sys.argv[1]]
else:
    prefix = []#'-interface','eth0:1']
line = lambda x: ['curl'] + prefix + ['-H', "Cookie: TENACIOUS=" + str(random.random())[3:], '-o', 'pdfs/' + str(x) + '.pdf', "http://www.jstor.org/stable/pdfplus/" + str(x) + ".pdf?acceptTC=true"]


while 1:
    blocks = getblocks()
    for block in blocks:
        print block
        subprocess.Popen(line(block)).wait()

I guess this is a tribute of sorts. I have always been extremely affected by Aaron's story and his passing. He was a brilliant pioneer of the internet, founding creative commons, the web feed format RSS, and Reddit, all before taking his own life at 26 years old.

I want to understand as much as I can about the event that led to the death of a man who did many great things for the internet and its growing community of users.

Context

Jstor is a large library of academic publications. When Aaron downloaded articles from their archive in 2010, JSTOR was freely available to MIT Students, but not freely available to the public. While we don't know exactly what Aaron wanted to do with the information, it's a safe bet he wanted to spread it to those without access.

My Question

I see he created a function Getblocks() that used the urllib module to access Jstor's digital archives, read the HTML of the web pages into a variable and split the contents of the page.

It's the command line section of the code that I don't understand, from after he imported the sys module through the end of the if/else statement.

He created a command line argument that allowed him to define..what? What was he doing here?

If the length of the command line argument was < 1 and the else condition was invoked, what was his lambda function accomplishing here?

if len(sys.argv) > 1:
    prefix = ['--socks5', sys.argv[1]]
else:
    prefix = []#'-interface','eth0:1']
line = lambda x: ['curl'] + prefix + ['-H', "Cookie: TENACIOUS=" + str(random.random())[3:], '-o', 'pdfs/' + str(x) + '.pdf', "http://www.jstor.org/stable/pdfplus/" + str(x) + ".pdf?acceptTC=true"]

Any insight into the mechanics of Aaron's large file seige would be greatly appreciated.

Rest easy, Aaron.

Additional Notes

The legal documents related to the case can be found here. In these documents, there's a link to the several conversations among Jstor's employees after Aaron downloaded all the documents. In one email exchange, a Jstor employee describes how Aaron circumvented the "sessions by IP" rule to download.

"By clearing their cookies and starting new session they effectively dodge the abuse tools in Literatum.... The # of sessions per IP rule did not fire because it is on a server by server basis and the user was load balanced across more than few servers. 8500 sessions would only need two servers to dodge the rule. We can ratchet the of sessions down but am requesting data to find an effective level that would have caught incident without disrupting normal users elsewhere With our MDC and number of servers there may be no sweet spot that accomplishes both."

like image 296
Patrick Harris Avatar asked Apr 07 '18 19:04

Patrick Harris


1 Answers

Your transcription of the code image is missing the last line, which is pretty key. Inside the loop, it's calling subprocess.Popen on the output of the line lambda function:

subprocess.Popen(line(block)).wait()

The getblocks function reads from a redacted website (probably not jstor) to get a list of pdf files to download. This allows the script to be remotely controlled.

The line lambda function produces a list of command line arguments that will be used by Popen to call the curl command line utility program, which does the actual downloading. The cookie mentioned in the quotation your "additional notes" section gets produced in the lambda (it generates a random number, converts it to a string, and slices all but the first three characters to get the cookie value).

like image 163
Blckknght Avatar answered Nov 15 '22 08:11

Blckknght