Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Site scraping: Waiting until site is completly loaded

I need to download the following webpage: http://m.10bet.com/#leage_panel#10096

It is a sportsbetting page and I need the quotes. So, in the first place this seems pretty simple. However, here is what happens (you can check this with eg. developer tools of your browser):

  1. Open the URL
  2. The page loads an initial HTML that subsequently invokes an ajax request to retrieve the quotes
  3. However, the quote are contained in json BUT they are obfruscated such that it is not possible to simply parse them directly from the ajax call. Additionally the javascript of the webpage is obfruscated as well. So no chance to directly read the quotes from the request.

Instead, I will need to use a headless browser capable of evaluating javascript. HtmlUnit for java is inadequate since it does not offer robust javascript functionality. Therefore PhantomJS in combination with CasperJS is my current choice. I apply CasperJS with the following script:

var casper = require('casper').create();

casper.start('http://m.10bet.com/#leage_panel#10096', function() {
    var url = 'http://m.10bet.com/#leage_panel#10096';
    this.download(url, '10bet.html');
});

casper.run(function() {
    this.echo('Done.').exit();
});

However, this script does not load the complete page. Just the inital page. How do I load the complete webpage as it is presented in the browser?

like image 746
toom Avatar asked Nov 26 '13 19:11

toom


1 Answers

That script looks like a good start, but as soon as your (HTML) page loads, the (CasperJS) script stops, because you have not given it any more instructions. The crudest way to fix this would be to go to sleep for a couple of seconds, then scrape the page:

var casper = require('casper').create();
var fs=require('fs');

casper.start('http://m.10bet.com/#leage_panel#10096', function() {
    this.wait(2000, function() {
        fs.write("10bet.html", this.getHTML() );
   });
});

casper.run();

A 2000ms time-out is crude for a couple of reasons:

  1. If the data loads quicker than that you are wasting time.
  2. If it loads slower your script does not work.

So it is better to identify something on the page that you want and need to be there, and then use one of Casper's waitForXXX() functions. See the API docs starting here: http://casperjs.readthedocs.org/en/latest/modules/casper.html#waitfor

As another point, I'm guessing you don't actually want the whole HTML page, just the data in it. getHTML() takes a parameter to filter what is received. E.g. in your case getHTML('#league_block') might be much more useful. Again, see the API docs for more ideas.

like image 129
Darren Cook Avatar answered Nov 04 '22 00:11

Darren Cook