Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrape a web page that requires they give you a session cookie first

I'm trying to scrape an excel file from a government "muster roll" database. However, the URL I have to access this excel file:

http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal

requires that I have a session cookie from the government site attached to the request.

How could I grab the session cookie with an initial request to the landing page (when they give you the session cookie) and then use it to hit the URL above to grab our excel file? I'm on Google App Engine using Python.

I tried this:

import urllib2
import cookielib

url = 'http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal'


def grab_data_with_cookie(cookie_jar, url):
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie_jar))
    data = opener.open(url)
    return data

cj = cookielib.CookieJar()

#grab the data 
data1 = grab_data_with_cookie(cj, url)
#the second time we do this, we get back the excel sheet.
data2 = grab_data_with_cookie(cj, url)

stuff2  = data2.read()

I'm pretty sure this isn't the best way to do this. How could I do this more cleanly, or even using the requests library?

like image 593
rd108 Avatar asked Mar 17 '12 23:03

rd108


People also ask

How do I scrape data from a website that requires a login?

ParseHub is a free and powerful web scraper that can log in to any site before it starts scraping data. You can then set it up to extract the specific data you want and download it all to an Excel or JSON file. To get started, make sure you download and install ParseHub for free.

Can I scrape a website legally?

Web scraping is completely legal if you scrape data publicly available on the internet. But some kinds of data are protected by international regulations, so be careful scraping personal data, intellectual property, or confidential data.

What is cookies in web scraping?

Web cookies, also known as HTTP cookies or browser cookies, are a piece of data sent by a server (HTTP response header) to a user's browser for later identification. In a later request (HTTP header request), the browser will send the cookie back to the server, making it possible for the server to recognize the browser.

How do I get session cookies in Python?

How do I get cookie data in python? Use the make_response() function to get the response object from the return value of the view function. After that, the cookie is stored using the set_cookie() function of the response object. It is easy to read back cookies.


1 Answers

Using requests this is a trivial task:

>>> url = 'http://httpbin.org/cookies/set/requests-is/awesome'
>>> r = requests.get(url)

>>> print r.cookies
{'requests-is': 'awesome'}
like image 177
Burhan Khalid Avatar answered Sep 22 '22 15:09

Burhan Khalid