Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to end a PhantomJS script only after client-side redirects have taken place

Tags:

phantomjs

I am working on integrating the PhantomJS headless browser into a project of mine (currently using version 1.6). For the most part, it is doing a great job at accomplishing that I need to accomplish. However, the asynchronous nature of the way that WebPage.open() calls work, and the need to call phantom.exit() at some point, makes it tricky to handle client side redirects when you can't anticipate where they're going to go.

What I'm after is a way to call phantom.exit() only after any meta refreshes (that lead to a different page) and JavaScript redirects tied to things like onload events have been executed. I can see why this is an issue, because in theory a client side redirect could take place any number of seconds after a page load, and I can't simply ask for the ability to exit only when no more redirects are going to take place. Right now, the best solution I can think of is to a) manually detect the presence of meta refresh elements on the page and deal with those myself, and b) use setInterval() to allow some sane amount of time (say, 1-1.5 seconds) to elapse before calling phantom.exit(). It would basically look like this:

var page = require('webpage').create();
var visitComplete = false;
var url = "http://some.url";
var pageOpenedTime;
setInterval(function() {
    if (visitcomplete && typeof pageOpenedTime != 'undefined' &&
        new Date() - pageOpenedTime >= 1500)
    {
        phantom.exit();
    }
), 1000);
page.open(url, function() {
    pageOpenedTime = new Date();
    if (!hasMetaRefresh(page)) {
        visitComplete = true;
    }
});

function hasMetaRefresh(page) {
    // Query the DOM here to detect meta refresh elements
}

Any better ideas?

Edit: I should mention that my first thought was that there might be a PhantomJS event that gets fired when the JavaScript associated with the initial page load has been executed, but the onLoadFinished callback appears to precede the execution of any in-page JavaScript, including onload events. I also did some testing about how much of an interval I might need to wait, and while 1000 ms was long enough for a JavaScript redirect (via body onload event) to get executed in a small test page, 100 ms was not long enough.

like image 283
Max Crowe Avatar asked Oct 03 '12 14:10

Max Crowe


1 Answers

I've had the same issue loading a page that was using Optimizely, and the variation was a location.href redirect.

I now use the onNavigationRequest callback inside a "renderPage" function. Those optimizely redirects no longer block and I don't need an arbitrary timeout.

var webpage = require('webpage');
var page = null;

var renderPage = function (myurl) {
    page = webpage.create();

    page.onNavigationRequested = function(url, type, willNavigate, main) {
        if (main && url!=myurl && url.replace(/\/$/,"")!=myurl&& (type=="Other" || type=="Undefined") ) {
        // main = navigation in main frame; type = not by click/submit etc

            log("\tfollowing "+myurl+" redirect to "+url)
            myurl = url;
            page.close();
            renderPage(url); // rerun this function wit the new URL
        }
    }; // on Nav req

    page.open(myurl, function(status) {
        if (status==="success") {
            page.render("screenshot.jpg");
        } else {
            page.close();
        }
    }); // page open
} // render page


renderPage("http://some.domain.com");

see docs: http://phantomjs.org/api/webpage/handler/on-navigation-requested.html

like image 158
ProfessionalHack Avatar answered Nov 15 '22 07:11

ProfessionalHack