Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I need to scrape data from a facebook game - using ruby

Revised (clarified question)

I've spent a few days already trying to figure out how to scrape specific information from a facebook game; however, I've run into brick wall after brick wall. As best as I can tell, the main problem is as follows. I can use Chrome's inspect element tool to manually find the html that I need - it appears nestled inside an iframe. However, when I try and scrape that iframe, it is empty (except for properties):

<iframe id="game_frame" name="game_frame" src="" scrolling="no" ...></iframe>

This is the same output that I see if I use a browsers "View page source" tool. I don't understand why I can't see the data in the iframe. The answer is NOT that it's being added afterwards by AJAX. (I know that both because "View page source" can read data that's been added by Ajax and also because I've b/c I've waited until after I can see the data page before scraping it and it's still not there).

Is this happening because of facebook's anti-screen scraping, and if so is there a way around it? Or am I just missing something. I program in ruby and I've tried nokogiri, then mechanize, then capybara without success.

I don't know if it makes any difference, but it seems to me that the iframe is getting it's data using the iframe's "game_frame" reference which apparently refers to this piece of html that appears earlier in the document:

<form id="hidden_login_form_1331840407" action="" method="POST" target="game_frame">
  <input type="hidden" name="signed_request" autocomplete="off" value="v6kIAsKTZa...">
  ...
</form>

Original question

I wrote a ruby program that uses nokogiri to scrape data from a facebook game's HTML. Currently, I get the HTML by using chrome's "inspect element" tool and I save it to a file and parse it from there. However, I would really like to be able to access the information from within ruby. For example, I would pass the program the page name "www.gamename.com/...?id=12345" and it would login to facebook, go to that page and scrape the data. Currently, if I try that, it doesn't work because I get redirected to facebook's login page. How can I get past the login screen to access the page(s) I need?

I would like to do this using the nokogiri code that I have already written; however, if I have to I could rewrite it using something else. Currently, the program is a standalone program - not a rails program - but I could change that. I've see some information that might point me in the direction of Omniauth but I'm not sure that's what I'm looking for and it also looks very complicated. I'm hoping there's a simpler solution.

Thanks

like image 367
Mike Schachter Avatar asked Mar 14 '12 02:03

Mike Schachter


2 Answers

I can recommend capybara-webkit for this kind of task. It uses QtWebkit under the hood and understands Javascript:

require 'capybara-webkit'
require 'capybara/dsl'
require 'nokogiri'

include Capybara::DSL
Capybara.current_driver = :webkit

# login
visit("https://www.facebook.com")
find("#email").set("user")
find("#pass").set("password")
find("#loginbutton//input").click

# navigate to the JS-generated page
visit("www.gamename.com/...?id=12345")

# parse HTML
doc = Nokogiri::HTML.parse(body)
like image 130
Niklas B. Avatar answered Nov 12 '22 12:11

Niklas B.


The easiest is to use mechanize:

require 'mechanize'
@agent = Mechanize.new{|a| a.user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
page = @agent.get 'http://www.facebook.com/'
form = page.forms[0]
form['email'], form['pass'] = '[email protected]', 'foobar'
form.submit
# now you're logged in and a request like this:
doc = @agent.get('http://www.facebook.com/').parser
# gives you a logged in Nokogiri::HTML::Document like you're used to
like image 23
pguardiario Avatar answered Nov 12 '22 13:11

pguardiario