I'm having a problem with web scraping in NodeJS, i want to take some data from the remote webpage but the data is inserted into html from the javascript. I started to use PhantomJS and it works great except one thing that preventing me to finish my job. PhantomJS is working too slow, this snippet of code needs about 14 seconds to execute!?
var page = require('webpage').create();
page.open('https://www.halooglasi.com/nekretnine/izdavanje-stanova/novi-beograd---novi-merkator-id19270/5425485514649', function () {
phantom.exit();
});
with request library who just returns raw data its much faster, a little more than a second so phantomJS is working another 13 seconds or so. It looks like PhantomJS is doing a lot of unnecesary operations which i dont need, i dont need to render pictures videos or anything i just need javascript to execute so i can use cheerio to get the data from html. Can you tell me how to speed up PhantomJS or maybe use some other faster webkit for my needs?
There are several measures you can take to decrease processing time.
1 . Get a more powerful server/computer (as Mathieu rightly noted).
Yes, you could argue this is irrelevant to the question, but in matters of scraping it very much is. On a budget $8 VPS without optimization your initial script ran for 9589ms
which is already a ~30% improvement.
2 . Turn off images load. It will help... a bit. 8160ms
load time.
page.settings.loadImages = false;
3 . Analyze the page, find and cancel unnecessary network requests.
Even in a normal browser like Google Chrome the site loads slowly: 129 requests/8.79s load time with AdblockPlus. There are a lot of requests (gif, 1Mb), many if them are for third-party sites like facebook, twitter (to fetch widgets) and to ad sites.
We can cancel them too:
block_urls = ['gstatic.com', 'adocean.pl', 'gemius.pl', 'twitter.com', 'facebook.net', 'facebook.com', 'planplus.rs'];
page.onResourceRequested = function(requestData, request){
for(url in block_urls) {
if(requestData.url.indexOf(block_urls[url]) !== -1) {
request.abort();
console.log(requestData.url + " aborted");
return;
}
}
}
The load time for me now is just 4393ms
while the page is loaded and usable: PhantomJS screenshot
I don't think much more can be done without tinkering with page's code because judging by the page source it is quite script-heavy.
The whole code:
var page = require('webpage').create();
var fs = require("fs");
// console.time polyfill from https://github.com/callmehiphop/console-time
;(function( console ) {
var timers;
if ( !console ) {
return;
}
timers = {};
console.time = function( name ) {
if ( name ) {
timers[ name ] = Date.now();
}
};
console.timeEnd = function( name ) {
if ( timers[ name ] ) {
console.log( name + ': ' + (Date.now() - timers[ name ]) + 'ms' );
delete timers[ name ];
}
};
}( window.console ));
console.time("open");
page.settings.loadImages = false;
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36';
page.viewportSize = {
width: 1280,
height: 800
};
block_urls = ['gstatic.com', 'adocean.pl', 'gemius.pl', 'twitter.com', 'facebook.net', 'facebook.com', 'planplus.rs'];
page.onResourceRequested = function(requestData, request){
for(url in block_urls) {
if(requestData.url.indexOf(block_urls[url]) !== -1) {
request.abort();
console.log(requestData.url + " aborted");
return;
}
}
}
page.open('https://www.halooglasi.com/nekretnine/izdavanje-stanova/novi-beograd---novi-merkator-id19270/5425485514649', function () {
fs.write("longload.html", page.content, 'w');
console.timeEnd("open");
setTimeout(function(){
page.render('longload.png');
phantom.exit();
}, 3000);
});
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With