Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

save phantom js processed page into html file with absolute url

I want to save my special web pages after document loaded into special file name via all url and links convert to absolute url such as wget -k.

//phantomjs
var page = require('webpage').create();
var url = 'http://google.com/';
page.open(url, function (status) {
var js = page.evaluate(function () {
return document;
});
console.log(js.all[0].outerHTML);
phantom.exit();
});

for example my html content somthing like this:

<a href="//page.html">page</a>

must be

<a href="http://google.com/page.html">page</a>

It's my sample script but how can i convert all url and links such as wget -k using phantomjs?

like image 771
sweb Avatar asked Jan 28 '13 00:01

sweb


2 Answers

You can modify your final HTML so that it has a <base> tag - this will make all relative URLs working. In your case, try putting <base href="http://google.com/"> right after the <head> on the page.

like image 135
Volodymyr Yamkovyy Avatar answered Nov 06 '22 22:11

Volodymyr Yamkovyy


It is not really supported by PhantomJS is more than just an HTTP client. Imagine if there is a JavaScript code which pulls a random content with image on the main landing page.

The workaround which might or might not for you is to replace all the referred resource in the DOM. This is possible using some CSS3 selector (href for a, src for img, etc) and manual path resolve relative to the base URL. If you really need to track and enlist every single resource URL, use the network traffic monitoring feature.

Last but not least, to get the generated content you can use page.content instead of that complicated dance with evaluate and outerHTML.

like image 35
Ariya Hidayat Avatar answered Nov 06 '22 22:11

Ariya Hidayat