Right now, I can crawl regular pages using urllib2.
request = urllib2.Request('http://stackoverflow.com')
request.add_header('User-Agent',random.choice(agents))
response = urllib2.urlopen(request)
htmlSource = response.read()
print htmlSource
However...I would like to simulate a POST (or fake sessions)? so that I can go into Facebook and crawl. How do I do that?
You'll need to keep the cookie your site of choice sends you when you log in; that's what keeps your session. With urllib2
, you do this by creating an Opener object that supports cookie processing:
import urllib2, cookielib
jar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
With this opener, you can do requests, either GET or POST:
content = opener.open(urllib2.Request(
"http://social.netwo.rk/login",
"user=foo&pass=bar")
).read()
As there's a second parameter to urllib2.Request, it'll be a POST request -- if that's None, you end up with a GET request. You can also add HTTP headers, either with .add_header
or by handing the constructor a dictionary (or a tuple-tuple) of headers. Read the manual for urllib2.Request for more information.
That should get you started! Good luck.
(ps: If you don't need read access to the cookies, you can just omit creating the cookie jar yourself; the HTTPCookieProcessor will do it for you.)
The Mechanize library is an easy way to emulate a browser in Python.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With