Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping data out of facebook using scrapy

The new graph search on facebook lets you search for current employees of a company using query token - Current Google employees (for example).

I want to scrape the results page (http://www.facebook.com/search/104958162837/employees/present) via scrapy.

Initial problem was facebook allows only a facebook user to access the information, so directing me to login.php. So, before scraping this url, I logged in via scrapy and then this result page. But even though the http response is 200 for this page, it does not scraps any data. The code is as follows:

import sys
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.http import Request

class DmozSpider(BaseSpider):
    name = "test"
    start_urls = ['https://www.facebook.com/login.php'];
    task_urls = [query]

def parse(self, response):
return [FormRequest.from_response(response, formname='login_form',formdata={'email':'myemailid','pass':'myfbpassword'}, callback=self.after_login)]

def after_login(self,response):
    if "authentication failed" in response.body:
            self.log("Login failed",level=log.ERROR)
            return
    return Request(query, callback=self.page_parse)

def page_parse(self,response):

    hxs = HtmlXPathSelector(response)
    print hxs
    items = hxs.select('//div[@class="_4_yl"]')
    count = 0
    print items

What could I have missed or done incorrectly?

like image 828
Aryabhatt Avatar asked May 31 '13 18:05

Aryabhatt


People also ask

Is it possible to scrape data from Facebook?

Is it still possible to scrape Facebook data? Yes. It is!

How do you scrape data from Scrapy?

While working with Scrapy, one needs to create scrapy project. In Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name gfgfetch.py python file. Move to the spider folder and create gfgfetch.py .


1 Answers

The problem is that search results (specifically div initial_browse_result) are loaded dynamically via javascript. Scrapy receives the page before those actions, so there is no results yet there.

Basically, you have two options here:

  • try to simulate these js (XHR) requests in scrapy, see:

    • Scraping ajax pages using python
    • Can scrapy be used to scrape dynamic content from websites that are using AJAX?
  • use the combination of scrapy and selenium, or scrapy and mechanize to load the whole page with the content, see:

    • Executing Javascript Submit form functions using scrapy in python
    • this answer

If you go with first option, you should analyze all requests going during the page load and figure out which one is responsible for getting the data you want to scrape.

The second is pretty straightforward, but will definitely work - you just use other tool to get the page with loaded via js data, then parse it to scrapy items.

Hope that helps.

like image 198
alecxe Avatar answered Sep 22 '22 22:09

alecxe