Problems with web site scraping using zombie.js

Tags:

I need to do some web scraping. After playing around with different web testing framework, of which most where either too slow (Selenium) or too buggy for my needs (env.js), I decided that zombie.js looks most promising, as it uses a solid set of libraries for HTML parsing and DOM manipulation. However, it seems to me like it doesn't even support basic event-based Javascript code like in the following web page:

Click to copy

<html>
  <head>
    <title>test</title>
    <script type="text/javascript">

      console.log("test script executing...");
      console.log("registering callback for event DOMContentLoaded on " + document);

      document.addEventListener('DOMContentLoaded', function(){
        console.log("DOMContentLoaded triggered");
      }, false);

      function loaded() {
        console.log("onload triggered");
      }

    </script>
  </head>

  <body onload="loaded();">
    <h1>Test</h1>
  </body>
</html>

I then decided to trigger those events manually like this:

Click to copy

zombie = require("zombie");

zombie.visit("http://localhost:4567/", { debug: true }, function (err, browser, status) {

  doc = browser.document;
  console.log("firing DOMContentLoaded on " + doc);
  browser.fire("DOMContentLoaded", doc, function (err, browser, status) {

    body = browser.querySelector("body");
    console.log("firing load on " + body);
    browser.fire("load", body, function (err, browser, status) {

      console.log(browser.html());

    });
  });

});

Which works for this particular test page. My problem is a more general one, though: I want to be able to scrape more complex, AJAX-based sites like a friends list on Facebook (something like http://www.facebook.com/profile.php?id=100000028174850&sk=friends&v=friends). It is no problem to log into the site using zombie, but some content like those lists seem to be completely loaded dynamically using AJAX, and I don't know how to trigger the event handlers that initiate the loading.

There are several questions I have regarding this problem:

Has somebody already implemented a similarly complex scraper without using a browser remote-controlling solution like Selenium?
Is there some reference on the loading process of a complex Javascript-based page?
Can somebody provide advice on how to debug a real browser to see what I might need to execute to trigger the Facebook event handlers?
Any other ideas about this topic?

Again, please do not point me to solutions involving controlling a real browser like Selenium, as I know about those. What is however welcome are suggestions for a real in-memory renderer like WebKit accessible from the Ruby scripting language, but preferrably with the possibility to set cookies and preferrably also load raw HTML instead of triggering real HTTP requests.

422

asked Sep 07 '11 15:09

Niklas B.

1 Answers

For purposes of data extraction, running a "headless browser" and triggering javascript events manually is not going to be the easiest thing to do. While not impossible, there are simpler ways to do it.

Most sites, even AJAX-heavy ones, can be scraped without executing a single line of their Javascript code. In fact it's usually easier than trying to figure out a site's Javascript code, which is often obfuscated, minified, and difficult to debug. If you have a solid understanding of HTTP you will understand why: (almost) all interactions with the server are encoded as HTTP requests, so whether they are initiated by Javascript, or the user clicking a link, or custom code in a bot program, there's no difference to the server. (I say almost because when Flash or applets get involved there's no telling what data is flying where; they can be application-specific. But anything done in Javascript will go over HTTP.)

That being said, it is possible to mimic a user on any website using custom software. First you have to be able to see the raw HTTP requests being sent to the server. You can use a proxy server to record requests made by a real browser to the target website. There are many, many tools you can use for this: Charles or Fiddler are handy, most dedicated screen-scraper tools have a basic proxy built-in, The Firebug extension for Firefox and Chrome have similar tools for viewing AJAX requests...you get the idea.

Once you can see the HTTP requests that are made as a result of a particular action on the website, it is easy to write a program to mimic these requests; just send the same requests to the server and it will treat your program just like a browser in which a particular action has been performed.

There are differing libraries for different languages offering different capabilities. For ruby, I have seen a lot of people using mechanize for ruby.

If data extraction is your only goal, then you'll almost always be able to get what you need by mimicking HTTP requests this way. No Javascript required.

Note - Since you mentioned Facebook, I should mention that scraping Facebook specifically can be exceptionally difficult (although not impossible), because Facebook has measures in place to detect automated access (they use more than just captchas); they will disable an account if they see suspicious activity coming from it. It is, after all, against their terms of service (section 3.2).

137

answered Sep 29 '22 17:09

jches

Related questions
                            
                                selecting multiple elements using shift and mouse click - jquery
                            
                                IE7 - <button> does not submit form
                            
                                Self-Invoking Functions in JavaScript
                            
                                Webkit animations performance on IPad
                            
                                Google Maps Polyline - How do I remove it?
                            
                                How do you make Greasemonkey Click a link that has specified text?
                            
                                What is this design pattern known as in JavaScript/jQuery?
                            
                                Simple jQuery / javascript method to escape special characters in string for regexp
                            
                                Making vector points blink using Raphael and Javascript
                            
                                Test if URL is accessible from web browser i.e. make sure not blocked by Proxy server
                            
                                Uncaught TypeError: undefined is not a function rails3/backbone/js
                            
                                How can I run a <script> tag that I just inserted dynamically from a BHO
                            
                                How can I implement a scrollable <div> on iPad?
                            
                                set style with :hover javascript
                            
                                InnerText alternative in mozilla [duplicate]
                            
                                How do I use jQuery .when() function with a dynamic set of ajax calls?
                            
                                How do I make a div follow me as I scroll down the page?
                            
                                How do I open a file stream in javascript?
                            
                                Ckeditor adds empty paragraphs when applying a style
                            
                                Precompile mustache templates or load externally?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Problems with web site scraping using zombie.js

Tags:

javascript

node.js

facebook

screen-scraping

zombie.js

Niklas B.

People also ask

1 Answers

jches

Recent Activity

Donate For Us