Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PhantomJS unexpected load behavior with multiple pages

i have a script (below) that scrapes a site with a 3 step process. it works great when set to a maximum of 1 page at a time. however, when i increase that to 2 at a time things start getting wonky. the onFinished fires earlier than i would expect and the page isn't completely loaded yet. because of this the rest of my script breaks. any idea why this might be happening? i should add that i'm using the newest version (1.5).

MAX_PAGES = 1
### 
changing MAX_PAGES to >1 causes some pages onFinished event to fire before
the page is fully rendered.  this is evident by the fact that there are >1 images
for some pages.  i havent been able to reproduce using microsoft.com, but on some
pages i was working on the first onLoadFinished seemed to be called before the page
was actually fully loaded based on the look of the rendered images
###

newPage = (id) ->
context = {}
context.id = id
context.step = 0
context.page = require('webpage').create()
context.page.onLoadStarted = ->
    context.step++
context.page.onLoadFinished = (status) ->
    console.log status
    if status is 'success'
        context.page.render("#{context.id}_#{context.step}.png")
    else
        context.page.release()
        context.page.open('http://www.microsoft.com')
        console.log 'started loading'

newPage id for id in [1..MAX_PAGES]
like image 913
hackerhasid Avatar asked Apr 27 '12 15:04

hackerhasid


People also ask

What is PhantomJS and why should I use it?

Because of its rendering features, PhantomJS can be used to capture web pages, essentially taking a screenshot of the contents. The following loadspeed.jsscript loads a specified URL (do not forget the httpprotocol) and measures the time it takes to load it.

Does PhantomJS exit at some point in the script?

It is very importantto call phantom.exitat some point in the script, otherwise PhantomJS will not be terminated at all. Page Loading A web page can be loaded, analyzed, and rendered by creating a webpageobject. The following script demonstrates the simplest use of page object.

Why does Phantom's onloadfinished callback fire too early?

The problem is many web-sites are loading their minor content async and that's why Phantom's onLoadFinished callback (analogue for onLoad in HTML) fired too early when not everything still has loaded. Can anyone suggest how can I wait for full load of a webpage to make, for example, a screenshot with all dynamic content like ads?

What is the use of WAITFOR in PhantomJS?

/** * See https://github. com/ariya/phantomjs/blob/master/examples/waitfor. js * * Wait until the test condition is true or a timeout occurs. Useful for waiting * on a server response or for a ui change (fadeIn, etc.) to occur. * * @param testFx javascript condition that evaluates to a boolean, * it can be passed in as a string.


1 Answers

I think the problem has to do with the fact that each webpage within PhantomJS is using the same QNetworkAccessManager, thus, the finished() signal is firing when each webpage object finishes loading. Modifications to PhantomJS's code might need to be made in order to fix this problem. I have noticed this before when trying to load multiple pages in parallel in PhantomJS. An application I'm working on uses QtWebkit and loads multiple pages simultaneously so I have to make sure that each webpage gets its own QNetworkAccessManager so that the finished() signals don't interfere with each other.

like image 149
Cameron Tinker Avatar answered Sep 28 '22 14:09

Cameron Tinker