Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pretend Firefox instead of Phantom.js

When I try to scrap this site with Phantomjs, by default, Phantomjs send the following headers to server:

"name":"User-Agent",
"value":"Mozilla/5.0 (Unknown; Linux i686) AppleWebKit/534.34 (KHTML, like Gecko) PhantomJS/1.9.1 Safari/534.34"}

And I get an status 405 "Not Allowed" response.

I read in the Phantomjs API Reference that in order to imitate a request coming from some other browser, I should change my User-Agent value. On Wikipedia I found the value I should use for pretending to be Firefox under Ubuntu:

'name': 'User-Agent',
'value': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:16.0) Gecko/20120815 Firefox/16.0'

In what part of Phantomjs should I put this properties? Where should I insert them - inside page.open, or inside page.evaluate, or at the top of it?

like image 481
khex Avatar asked Aug 19 '13 22:08

khex


1 Answers

Actually, is on page.settings. Do it before the open.

Here is an example using it against that page you linked:

var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36';
page.open('http://www.oddsportal.com/baseball/usa/mlb/results/page/', function() {
    window.setTimeout(function() {
        var output = page.evaluate(function() {
            return document.getElementById('tournamentTable')
           .getElementsByClassName('deactivate')[0]
           .getElementsByTagName('a')[0]
           .textContent;
        });
        console.log(output);
    }, 1000);
});

This example will scrape the match name in first row on the table. (Which, in this precise moment is "San Francisco Giants - Boston Red Sox")


About your comment, actually you can use jquery under phantomjs! Check this example:

var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36';
page.open('http://www.oddsportal.com/baseball/usa/mlb/results/page/', function() {
    window.setTimeout(function() {
        page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js", function() {
            var output = page.evaluate(function () {
                return jQuery('#tournamentTable .deactivate:first a:first').text();
            });
            console.log(output);
        });
    }, 1000);
});

By the way, for waiting, instead of the window.setTimeout I used on this examples, I recommend you to use waitfor.js instead.

like image 67
chris-l Avatar answered Sep 20 '22 17:09

chris-l