Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

save html output of page after execution of the page's javascript

There is a site I am trying to scrape, that first loads an html/js modifies the form input fields using js and then POSTs. How can I get the final html output of the POSTed page?

I tried to do this with phantomjs, but it seems to only have an option to render image files. Googling around suggests it should be possible , but I can't figure out how. My attempt:

var page = require('webpage').create();
var fs = require('fs');
page.open('https://www.somesite.com/page.aspx', function () {
    page.evaluate(function(){

    });

    page.render('export.png');
    fs.write('1.html', page.content, 'w');
    phantom.exit();
});

This code will be used for a client, I can't expect him to install too many packages (nodejs , casperjs etc)

Thanks

like image 700
gyaani_guy Avatar asked May 31 '13 11:05

gyaani_guy


People also ask

How do I save HTML file in JavaScript?

The only way to save data locally with Javascript seems to be with cookies, localStorage , or allow the user to download a file (with a "Save..." dialog box or to the browser's default Download folder).

How do I save a page in JavaScript?

Using the localStorage object The local storage object stores the data in the user's browser through the key and value pair. We can create key and value pairs from JavaScript and store them in the local storage. As a value, we can store HTML content such as image content, row HTML content, many more, etc.


2 Answers

the output code you have is correct, but there is an issue with synchronicity. The output lines that you have are being executed before the page is done loading. You can tie into the onLoadFinished Callback to find out when that happens. See full code below.

    var page = new WebPage()
    var fs = require('fs');

    page.onLoadFinished = function() {
      console.log("page load finished");
      page.render('export.png');
      fs.write('1.html', page.content, 'w');
      phantom.exit();
    };

    page.open("http://www.google.com", function() {
      page.evaluate(function() {
      });
    });

When using a site like google, it can be deceiving because it loads so quicker, that you can often execute a screengrab inline like you have it. Timing is a tricky thing in phantomjs, sometimes I test with setTimeout to see if timing is an issue.

like image 122
uffa Avatar answered Oct 19 '22 01:10

uffa


When I copied your code directly, and changed the URL to www.google.com, it worked fine, with two files saved:

  • 1.html
  • export.png

Bear in mind that the files will be written to the location you run the script from, not where your .js file is located

like image 4
Owen Martin Avatar answered Oct 19 '22 00:10

Owen Martin