Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Python to log into Facebook/Myspace and crawl the content?

Right now, I can crawl regular pages using urllib2.

request = urllib2.Request('http://stackoverflow.com')
request.add_header('User-Agent',random.choice(agents))
response = urllib2.urlopen(request)
htmlSource = response.read()
print htmlSource

However...I would like to simulate a POST (or fake sessions)? so that I can go into Facebook and crawl. How do I do that?

like image 757
TIMEX Avatar asked Dec 14 '22 02:12

TIMEX


2 Answers

You'll need to keep the cookie your site of choice sends you when you log in; that's what keeps your session. With urllib2, you do this by creating an Opener object that supports cookie processing:

import urllib2, cookielib
jar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))

With this opener, you can do requests, either GET or POST:

content = opener.open(urllib2.Request(
    "http://social.netwo.rk/login",
    "user=foo&pass=bar")
).read()

As there's a second parameter to urllib2.Request, it'll be a POST request -- if that's None, you end up with a GET request. You can also add HTTP headers, either with .add_header or by handing the constructor a dictionary (or a tuple-tuple) of headers. Read the manual for urllib2.Request for more information.

That should get you started! Good luck.

(ps: If you don't need read access to the cookies, you can just omit creating the cookie jar yourself; the HTTPCookieProcessor will do it for you.)

like image 180
AKX Avatar answered Apr 20 '23 01:04

AKX


The Mechanize library is an easy way to emulate a browser in Python.

like image 45
Walter Avatar answered Apr 20 '23 00:04

Walter