Scraping Data from Facebook with Python

Tags:

I've been trying for several day now (unsuccessfully) to scrape cities from about 500 Facebook URLs. However, Facebook handles its data in a very strange way and I can't figure out what's going on under the hood to understand what I need to do.

Essentially the problem is that Facebook displays very different amounts of data depending on who is logged in, and what the privacy settings of the account are. For instance, try opening the following three links, both in a browser where you are logged into Facebook, and one where you are not:

[REDACTED LINKS DUE TO PRIVACY CONCERNS]

As you can see, Facebook loads the data in both cases for the first link, but only gets data for the second link if you are logged in (to ANY account). The third link displays city when you are logged in, but only displays other information when you are not.

The reason this is extremely problematic (and related to Python) is that when trying to scrape the page with Beautiful Soup or Mechanize, I cannot figure out how to get the program to "pretend" that I am logged into an account. This means that I can easily grab data off the first type of link (of which there are less than 10), but I cannot get city off the second or third type. So far I've tried a number of solutions with little success.

Here's some sample code that works correctly for the first type, but not for other types:

import mechanize
import re
import csv

user_info = []

fb_url = 'http://www.facebook.com/100004210542493'
br = mechanize.Browser()
br.set_handle_robots(False)

br.open(fb_url)
all_html = br.response().get_data()
print all_html

city = re.search('fsl fwb fcb">(.+?)</a></div><div class="aboutSubtitle fsm fwn fcg', all_html).group(1)

user_info = [fb_url, city]
print user_info

I also have a version that uses Beautiful Soup. If anyone has any ideas on how to get around this, I would be extremely grateful. Thank you!

329

asked Sep 27 '13 02:09

cscanlin

2 Answers

The right way to do this is to use the facebook API. For various business, security, and privacy reasons they go out of their way to make scraping data tricky.

If you insist on scraping I would try to log in first using mechanize to submit the form. I've never tried to do this with facebook, but alot of websites have easier to parse versions intended for mobile users at m.site.com.

125

answered Oct 23 '22 12:10

James Robinson

I think scraping data from facebook is illegal. It is there in the terms of using facebook. Every activity is registered with your login details, even when you use a bot to scrape. If caught, they can ban you from using facebook for your lifetime. If there is a potential threat to any asset that you may pose, they can penalize you further.

answered Oct 23 '22 13:10

TNT

Related questions
                            
                                Can python detect which OS is it running under?
                            
                                Why does input() give a SyntaxError when I just press enter?
                            
                                Generating circular shifts / reduced Latin Squares in Python
                            
                                A cool python script to get teen learning python excited about programming? [closed]
                            
                                Iterating through THREE lists at once in Python?
                            
                                python: easiest way to get a string of spaces of length N
                            
                                Is it possible to split and assign a string in a single statement?
                            
                                modify list element with list comprehension in python
                            
                                Automatically exporting all functions (vs manually specifying __all__)
                            
                                Finding the sum of even valued terms in Fibonacci sequence
                            
                                how do you make a For loop when you don't need index in python?
                            
                                Optimise filtering lists in Python 2.7
                            
                                Django filter() lookup type documentation
                            
                                What does '[0]' mean in Python?
                            
                                Python: Replace All Values in a Dictionary
                            
                                Get HWND of each Window?
                            
                                os.walk() ValueError: need more than 1 value to unpack
                            
                                Trying to count words in a string
                            
                                Borderless matplotlib plots
                            
                                Simple DER Cert Parsing in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scraping Data from Facebook with Python

Tags:

python

facebook

beautifulsoup

web-scraping

mechanize

cscanlin

People also ask

2 Answers

James Robinson

TNT

Recent Activity

Donate For Us