I'm trying to scrape an excel file from a government "muster roll" database. However, the URL I have to access this excel file:
http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal
requires that I have a session cookie from the government site attached to the request.
How could I grab the session cookie with an initial request to the landing page (when they give you the session cookie) and then use it to hit the URL above to grab our excel file? I'm on Google App Engine using Python.
I tried this:
import urllib2
import cookielib
url = 'http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal'
def grab_data_with_cookie(cookie_jar, url):
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie_jar))
data = opener.open(url)
return data
cj = cookielib.CookieJar()
#grab the data
data1 = grab_data_with_cookie(cj, url)
#the second time we do this, we get back the excel sheet.
data2 = grab_data_with_cookie(cj, url)
stuff2 = data2.read()
I'm pretty sure this isn't the best way to do this. How could I do this more cleanly, or even using the requests library?
ParseHub is a free and powerful web scraper that can log in to any site before it starts scraping data. You can then set it up to extract the specific data you want and download it all to an Excel or JSON file. To get started, make sure you download and install ParseHub for free.
Web scraping is completely legal if you scrape data publicly available on the internet. But some kinds of data are protected by international regulations, so be careful scraping personal data, intellectual property, or confidential data.
Web cookies, also known as HTTP cookies or browser cookies, are a piece of data sent by a server (HTTP response header) to a user's browser for later identification. In a later request (HTTP header request), the browser will send the cookie back to the server, making it possible for the server to recognize the browser.
How do I get cookie data in python? Use the make_response() function to get the response object from the return value of the view function. After that, the cookie is stored using the set_cookie() function of the response object. It is easy to read back cookies.
Using requests this is a trivial task:
>>> url = 'http://httpbin.org/cookies/set/requests-is/awesome'
>>> r = requests.get(url)
>>> print r.cookies
{'requests-is': 'awesome'}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With