Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PhantomJS and pjscrape - Failing on some multiple URLs

Overview

I am trying to create a very basic scraper with PhantomJS and pjscrape framework.

My Code

pjs.config({
timeoutInterval: 6000,
timeoutLimit: 10000,
format: 'csv',
csvFields: ['productTitle','price'],
writer: 'file',
outFile: 'D:\\prod_details.csv'
});

pjs.addSuite({
title: 'ChainReactionCycles Scraper',
url: productURLs, //This is an array of URLs, two example are defined below
scrapers: [
    function() {
        var results [];
        var linkTitle = _pjs.getText('#ModelsDisplayStyle4_LblTitle');
        var linkPrice = _pjs.getText('#ModelsDisplayStyle4_LblMinPrice');
        results.push([linkTitle[0],linkPrice[0]]); 
        return results;
    }
]
});

URL Array's Used

This first array DOES NOT WORK and fails after the 3rd or 4th URL.

var productURLs = ["8649","17374","7327","7325","14892","8650","8651","14893","18090","51318"];
for(var i=0;i<productURLs.length;++i){
  productURLs[i] = 'http://www.chainreactioncycles.com/Models.aspx?ModelID=' + productURLs[i];
}

This second array WORKS and does not fail, even though it is from the same site.

var categoriesURLs = ["304","2420","965","518","514","1667","521","1302","1138","510"];
for(var i=0;i<categoriesURLs.length;++i){
  categoriesURLs[i] = 'http://www.chainreactioncycles.com/Categories.aspx?CategoryID=' + categoriesURLs[i];
}

Problem

When iterating through productURLs the PhantomJS page.open optional callback automatically assumes failure. Even when the page hasn't finished loading.

I know this as I started the script up while running an HTTP debugger and the HTTP request were still running even after PhantomJS had reported a a page load failure.

However, the code works fine when running with categoriesURLs.

Assumptions

  1. All the URL's listed above are VALID
  2. I have the latest versions of both PhantomJS and pjscrape

Possible Solutions

These are solutions I have tried thus far.

  1. Disabling image loading page.options.loadImages = false
  2. Settings a larger timeoutInterval in pjs.config this was not useful apparently as the error generated was of a page.open failure and NOT a timeout failure.

Any ideas?

like image 568
Hzmy Avatar asked Nov 04 '22 04:11

Hzmy


1 Answers

The problem was caused by PhantomJS. This has now been resolved.

I now use PhantomJS v2.0.

like image 104
Hzmy Avatar answered Nov 09 '22 13:11

Hzmy