Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problems with web site scraping using zombie.js

I need to do some web scraping. After playing around with different web testing framework, of which most where either too slow (Selenium) or too buggy for my needs (env.js), I decided that zombie.js looks most promising, as it uses a solid set of libraries for HTML parsing and DOM manipulation. However, it seems to me like it doesn't even support basic event-based Javascript code like in the following web page:

<html>
  <head>
    <title>test</title>
    <script type="text/javascript">

      console.log("test script executing...");
      console.log("registering callback for event DOMContentLoaded on " + document);

      document.addEventListener('DOMContentLoaded', function(){
        console.log("DOMContentLoaded triggered");
      }, false);

      function loaded() {
        console.log("onload triggered");
      }

    </script>
  </head>

  <body onload="loaded();">
    <h1>Test</h1>
  </body>
</html>

I then decided to trigger those events manually like this:

zombie = require("zombie");

zombie.visit("http://localhost:4567/", { debug: true }, function (err, browser, status) {

  doc = browser.document;
  console.log("firing DOMContentLoaded on " + doc);
  browser.fire("DOMContentLoaded", doc, function (err, browser, status) {

    body = browser.querySelector("body");
    console.log("firing load on " + body);
    browser.fire("load", body, function (err, browser, status) {

      console.log(browser.html());

    });
  });

});

Which works for this particular test page. My problem is a more general one, though: I want to be able to scrape more complex, AJAX-based sites like a friends list on Facebook (something like http://www.facebook.com/profile.php?id=100000028174850&sk=friends&v=friends). It is no problem to log into the site using zombie, but some content like those lists seem to be completely loaded dynamically using AJAX, and I don't know how to trigger the event handlers that initiate the loading.

There are several questions I have regarding this problem:

  • Has somebody already implemented a similarly complex scraper without using a browser remote-controlling solution like Selenium?
  • Is there some reference on the loading process of a complex Javascript-based page?
  • Can somebody provide advice on how to debug a real browser to see what I might need to execute to trigger the Facebook event handlers?
  • Any other ideas about this topic?

Again, please do not point me to solutions involving controlling a real browser like Selenium, as I know about those. What is however welcome are suggestions for a real in-memory renderer like WebKit accessible from the Ruby scripting language, but preferrably with the possibility to set cookies and preferrably also load raw HTML instead of triggering real HTTP requests.

like image 422
Niklas B. Avatar asked Sep 07 '11 15:09

Niklas B.


People also ask

Is JS good for web scraping?

js, JavaScript is a great language to use for a web scraper: not only is Node fast, but you'll likely end up using a lot of the same methods you're used to from querying the DOM with front-end JavaScript.

Which is better for web scraping JavaScript or Python?

Python is more widely used for web scraping purposes due to the popularity and ease of using the Beautiful Soup library, making it simple to navigate and search through parse trees. Yet, JavaScript might be a better option for programmers who already have experience with this programming language.

Is Node JS good for scraping?

Web scraping is the process of extracting data from a website in an automated way and Node. js can be used for web scraping. Even though other languages and frameworks are more popular for web scraping, Node. js can be utilized well to do the job too.


1 Answers

For purposes of data extraction, running a "headless browser" and triggering javascript events manually is not going to be the easiest thing to do. While not impossible, there are simpler ways to do it.

Most sites, even AJAX-heavy ones, can be scraped without executing a single line of their Javascript code. In fact it's usually easier than trying to figure out a site's Javascript code, which is often obfuscated, minified, and difficult to debug. If you have a solid understanding of HTTP you will understand why: (almost) all interactions with the server are encoded as HTTP requests, so whether they are initiated by Javascript, or the user clicking a link, or custom code in a bot program, there's no difference to the server. (I say almost because when Flash or applets get involved there's no telling what data is flying where; they can be application-specific. But anything done in Javascript will go over HTTP.)

That being said, it is possible to mimic a user on any website using custom software. First you have to be able to see the raw HTTP requests being sent to the server. You can use a proxy server to record requests made by a real browser to the target website. There are many, many tools you can use for this: Charles or Fiddler are handy, most dedicated screen-scraper tools have a basic proxy built-in, The Firebug extension for Firefox and Chrome have similar tools for viewing AJAX requests...you get the idea.

Once you can see the HTTP requests that are made as a result of a particular action on the website, it is easy to write a program to mimic these requests; just send the same requests to the server and it will treat your program just like a browser in which a particular action has been performed.

There are differing libraries for different languages offering different capabilities. For ruby, I have seen a lot of people using mechanize for ruby.

If data extraction is your only goal, then you'll almost always be able to get what you need by mimicking HTTP requests this way. No Javascript required.

Note - Since you mentioned Facebook, I should mention that scraping Facebook specifically can be exceptionally difficult (although not impossible), because Facebook has measures in place to detect automated access (they use more than just captchas); they will disable an account if they see suspicious activity coming from it. It is, after all, against their terms of service (section 3.2).

like image 137
jches Avatar answered Sep 29 '22 17:09

jches