Scrape a web page that requires they give you a session cookie first

Tags:

I'm trying to scrape an excel file from a government "muster roll" database. However, the URL I have to access this excel file:

http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal

requires that I have a session cookie from the government site attached to the request.

How could I grab the session cookie with an initial request to the landing page (when they give you the session cookie) and then use it to hit the URL above to grab our excel file? I'm on Google App Engine using Python.

I tried this:

import urllib2
import cookielib

url = 'http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal'


def grab_data_with_cookie(cookie_jar, url):
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie_jar))
    data = opener.open(url)
    return data

cj = cookielib.CookieJar()

#grab the data 
data1 = grab_data_with_cookie(cj, url)
#the second time we do this, we get back the excel sheet.
data2 = grab_data_with_cookie(cj, url)

stuff2  = data2.read()

I'm pretty sure this isn't the best way to do this. How could I do this more cleanly, or even using the requests library?

593

asked Mar 17 '12 23:03

rd108

1 Answers

Using requests this is a trivial task:

>>> url = 'http://httpbin.org/cookies/set/requests-is/awesome'
>>> r = requests.get(url)

>>> print r.cookies
{'requests-is': 'awesome'}

177

answered Sep 22 '22 15:09

Burhan Khalid

Related questions
                            
                                Guaranteeing a file close
                            
                                Trying to get one cell's values with MySQLdb
                            
                                Why is creating a set from a filter so much faster than creating a list or a tuple?
                            
                                Is nose an extension of unittest?
                            
                                django command for table size or tuple size (physical memory)?
                            
                                pymongo generator fails - 'return' with argument inside generator
                            
                                k-permutations in lexicographical order
                            
                                python for loop list plus one item
                            
                                Constructor chaining in python
                            
                                Alternative for scipy.stats.norm.pdf?
                            
                                list comprehension equivalent without producing a throwaway list [duplicate]
                            
                                rounding float up $.01 in python
                            
                                Ray and square/rectangle intersection in 3D
                            
                                Any alternative way to check if there is any attribute in python?
                            
                                List of unicode strings
                            
                                Unpack format characters in Python
                            
                                Way to ignore case in iteration of glob
                            
                                How do I expire keys in dynamoDB with Boto?
                            
                                Python: Gmail Unread Mails Crashes
                            
                                Apply a method to an object of another class

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrape a web page that requires they give you a session cookie first

Tags:

python

urllib2

web-scraping

google-app-engine

rd108

People also ask

1 Answers

Burhan Khalid

Recent Activity

Donate For Us