Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Taking reliable screenshots of websites? Phantomjs and Casperjs both return empty screen shots on some websites

Tags:

Open a web page and take a screenshot.

Using ONLY phantomjs: (this is a simple script, in fact it is the example script used in their docs. http://phantomjs.org/screen-capture.html

var page = require('webpage').create(); page.open('http://github.com/', function() {   page.render('github.png');   phantom.exit(); }); 

Problem is that for some websites (like github) funny enough are somehow detecting and not serving phantomjs and nothing is being rendered. Result is github.png is a blank white png file.

Replace github with say: "google.com" and you get a nice (proper) screenshot as is intended.

At first I thought this was a Phantomjs issue so I tried running it through Casperjs with:

casper.start('http://www.github.com/', function() {     this.captureSelector('github.png', 'body'); });  casper.run(); 

But I get same behavior as with Phantomjs.

So I figured ok this is most likely a user agent issue. As in: Github sniffs out Phantomjs and decides not to show the page. So I set the user agent like below but that still didn't work.

var page = require('webpage').create(); page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'; page.open('http://github.com/', function() {   page.render('github.png');   phantom.exit(); }); 

So then I tried to parse the page and apparently some sites (again like github) don't appear to be sending anything down the wire.

Using casperjs I tried to print the title. And for google.com I got back Google but for github.com I got back bupkis. Example code:

var casper = require('casper').create();  casper.start('http://github.com/', function() {     this.echo(this.getTitle()); });  casper.run();   

The same as above also produces the same result in purely phantomjs.

Update:

Could this be a timing issue? Is github just super slow? I doubt it but lets test anyway..

var page = require('webpage').create(); page.open('http://github.com', function (status) {     /* irrelevant */    window.setTimeout(function () {             page.render('github.png');             phantom.exit();         }, 3000); }); 

And the result is still bupkis. So no it's not a timing issue.

  1. How are some sites like github blocking phantomjs?
  2. How can we reliably take screenshots of ALL webpages? Required to be fast, and headless.
like image 814
MrPizzaFace Avatar asked Oct 22 '14 21:10

MrPizzaFace


1 Answers

After bouncing this around for some time I was able to narrow down the problem. Apparently PhantomJS uses a default ssl of sslv3 which causes github to refuse the connection due to a bad ssl handshake

phantomjs --debug=true github.js 

Shows output of:

. . . 2014-10-22T19:48:31 [DEBUG] WebPage - updateLoadingProgress: 10  2014-10-22T19:48:32 [DEBUG] Network - Resource request error: 6 ( "SSL handshake failed" ) URL: "https://github.com/"  2014-10-22T19:48:32 [DEBUG] WebPage - updateLoadingProgress: 100  

So from this we can conclude that no screen was taken because github was refusing the connection. Great that makes perfect sense. So let's set SSL flag to --ssl-protocol=any and lets also ignore ssl-errors with --ignore-ssl-errors=true

phantomjs --ignore-ssl-errors=true --ssl-protocol=any --debug=true github.js 

Great success! A screenshot is now being rendered and saved properly but debugger is showing us a TypeError:

TypeError: 'undefined' is not a function (evaluating 'Array.prototype.forEach.call.bind(Array.prototype.forEach)')    https://assets-cdn.github.com/assets/frameworks-dabc650f8a51dffd1d4376a3522cbda5536e4807e01d2a86ff7e60d8d6ee3029.js:29   https://assets-cdn.github.com/assets/frameworks-dabc650f8a51dffd1d4376a3522cbda5536e4807e01d2a86ff7e60d8d6ee3029.js:29 2014-10-22T19:52:32 [DEBUG] WebPage - updateLoadingProgress: 72  2014-10-22T19:52:32 [DEBUG] WebPage - updateLoadingProgress: 88  ReferenceError: Can't find variable: $    https://assets-cdn.github.com/assets/github-fa2f009761e3bc4750ed00845b9717b09646361cbbc3fa473ad64de9ca6ccf5b.js:1   https://assets-cdn.github.com/assets/github-fa2f009761e3bc4750ed00845b9717b09646361cbbc3fa473ad64de9ca6ccf5b.js:1 

I checked the github homepage manually just to see if a TypeError existed and it does NOT.

My next guess is that the assets aren't loading quick enough.. Phantomjs is faster than a speeding bullet!

So lets try to slow it down artificially and see if we can get rid of that TypeError...

var page = require('webpage').create(); page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'; page.open('http://github.com', function (status) {    window.setTimeout(function () {             page.render('github.png');             phantom.exit();         }, 3000); }); 

That didn't work... After a closer inspection of the image - it is clear that some elements are missing. Mainly some icons and the logo.

Success? Partially because we are now at least getting a screen shot where earlier, we weren't getting a thing.

Job done? Not exactly. Need to determine what is causing that TypeError because it preventing some assets from loading and distorting the image.

Additional

Attempted to recreate with CasperJS --debug is very ugly and hard to follow compared to PhantomJS:

casper.start(); casper.userAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X)'); casper.thenOpen('https://www.github.com/', function() {     this.captureSelector('github.png', 'body'); });  casper.run(); 

console:

casperjs test --ssl-protocol=any --debug=true github.js 

Further the image is missing the same icons but is also visually distorted. Being that CasperJs relies on Phantomjs, I do not see the value in using it for this specific task.

If you would like to add to my answer, please share your findings. Very interested in a flawless PhantomJS solution

Update #1 : Removing the TypeError

@ArtjomB points out that Phantomjs does not support js bind in it's current version as of this update (1.9.7). For this reason he explains: ArtjomB: PhantomJs Bind Issue Answer

The TypeError: 'undefined' is not a function refers to bind, because PhantomJS 1.x doesn't support it. PhantomJS 1.x uses an old fork of QtWebkit which is comparable to Chrome 13 or Safari 5. The newer PhantomJS 2 will use a newer engine which will support bind. For now you need to add a shim inside of the page.onInitialized event handler:

Ok great, so the following code will take care of our TypeError from above. (But not fully functional, see below for details)

var page = require('webpage').create(); page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'; page.open('http://github.com', function (status) {    window.setTimeout(function () {             page.render('github.png');             phantom.exit();         }, 5000); }); page.onInitialized = function(){     page.evaluate(function(){         var isFunction = function(o) {           return typeof o == 'function';         };          var bind,           slice = [].slice,           proto = Function.prototype,           featureMap;          featureMap = {           'function-bind': 'bind'         };          function has(feature) {           var prop = featureMap[feature];           return isFunction(proto[prop]);         }          // check for missing features         if (!has('function-bind')) {           // adapted from Mozilla Developer Network example at           // https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Function/bind           bind = function bind(obj) {             var args = slice.call(arguments, 1),               self = this,               nop = function() {               },               bound = function() {                 return self.apply(this instanceof nop ? this : (obj || {}), args.concat(slice.call(arguments)));               };             nop.prototype = this.prototype || {}; // Firefox cries sometimes if prototype is undefined             bound.prototype = new nop();             return bound;           };           proto.bind = bind;         }     }); } 

Now the above code will get us a screenshot same as we were getting before AND debug will not show a TypeError so from the surface, everything appears to work. Progress has been made.

Unfortunately, all of the image icons [logo, etc] are still not loading correctly. We see some sort of 3W icon not sure where thats from.

Thanks for the help @ArtjomB

enter image description here

like image 127
MrPizzaFace Avatar answered Nov 13 '22 05:11

MrPizzaFace