When I try to scrap this site with Phantomjs, by default, Phantomjs send the following headers to server:
"name":"User-Agent",
"value":"Mozilla/5.0 (Unknown; Linux i686) AppleWebKit/534.34 (KHTML, like Gecko) PhantomJS/1.9.1 Safari/534.34"}
And I get an status 405 "Not Allowed" response.
I read in the Phantomjs API Reference that in order to imitate a request coming from some other browser, I should change my User-Agent value. On Wikipedia I found the value I should use for pretending to be Firefox under Ubuntu:
'name': 'User-Agent',
'value': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:16.0) Gecko/20120815 Firefox/16.0'
In what part of Phantomjs should I put this properties? Where should I insert them - inside page.open, or inside page.evaluate, or at the top of it?
Actually, is on page.settings. Do it before the open.
Here is an example using it against that page you linked:
var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36';
page.open('http://www.oddsportal.com/baseball/usa/mlb/results/page/', function() {
window.setTimeout(function() {
var output = page.evaluate(function() {
return document.getElementById('tournamentTable')
.getElementsByClassName('deactivate')[0]
.getElementsByTagName('a')[0]
.textContent;
});
console.log(output);
}, 1000);
});
This example will scrape the match name in first row on the table. (Which, in this precise moment is "San Francisco Giants - Boston Red Sox")
About your comment, actually you can use jquery under phantomjs! Check this example:
var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36';
page.open('http://www.oddsportal.com/baseball/usa/mlb/results/page/', function() {
window.setTimeout(function() {
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js", function() {
var output = page.evaluate(function () {
return jQuery('#tournamentTable .deactivate:first a:first').text();
});
console.log(output);
});
}, 1000);
});
By the way, for waiting, instead of the window.setTimeout I used on this examples, I recommend you to use waitfor.js instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With