Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying DOM Manipulations to HTML and saving the result?

I have about 100 static HTML pages that I want to apply some DOM manipulations to. They all follow the same HTML structure. I want to apply some DOM manipulations to each of these files, and then save the resulting HTML.

These are the manipulations I want to apply:

# [start]
$("h1.title, h2.description", this).wrap("<hgroup>");
if ( $("h1.title").height() < 200 ) {
  $("div.content").addClass('tall');
}
# [end]
# SAVE NEW HTML

The first line (.wrap()) I could easily do with a find and replace, but it gets tricky when I have to determine the calculated height of an element, which can't be easily be determined sans-JavaScript.

Does anyone know how I can achieve this? Thanks!

like image 509
gabriel Avatar asked Jul 28 '11 20:07

gabriel


People also ask

What is DOM manipulation in HTML?

"The W3C Document Object Model (DOM) is a platform and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure, and style of a document." The W3C DOM standard is separated into 3 different parts: Core DOM - standard model for all document types.

Which of the following is used for DOM manipulations?

jQuery provides various methods to add, edit or delete DOM element(s) in the HTML page. The following table lists some important methods to add/remove new DOM elements. Inserts content to the end of element(s) which is specified by a selector.

Which method can be used by JavaScript to manipulate HTML elements in the DOM?

Using the innerHTML Property You can use the innerHTML property to add HTML elements to the DOM. This property is available on the document object and any element object that is a part of the DOM. The innerHTML property stores a string representation of the DOM's HTML.


2 Answers

While the first part could indeed be solved in "text mode" using regular expressions or a more complete DOM implementation in JavaScript, for the second part (the height calculation), you'll need a real, full browser or a headless engine like PhantomJS.

From the PhantomJS homepage:

PhantomJS is a command-line tool that packs and embeds WebKit. Literally it acts like any other WebKit-based web browser, except that nothing gets displayed to the screen (thus, the term headless). In addition to that, PhantomJS can be controlled or scripted using its JavaScript API.


A schematic instruction (which I admit is not tested) follows.

In your modification script (say, modify-html-file.js) open an HTML page, modify it's DOM tree and console.log the HTML of the root element:

var page = new WebPage();

page.open(encodeURI('file://' + phantom.args[0]), function (status) {
    if (status === 'success') {
        var html = page.evaluate(function () {
            // your DOM manipulation here
            return document.documentElement.outerHTML;
        });
        console.log(html);
    }
    phantom.exit();
});

Next, save the new HTML by redirecting your script's output to a file:

#!/bin/bash

mkdir modified
for i in *.html; do
    phantomjs modify-html-file.js "$1" > modified/"$1"
done
like image 118
katspaugh Avatar answered Oct 20 '22 12:10

katspaugh


I tried PhantomJS as in katspaugh's answer, but ran into several issues trying to manipulate pages. My use case was modifying the static html output of Doxygen, without modifying Doxygen itself. The goal was to reduce delivered file size by remove unnecessary elements from the page, and convert it to HTML5. Additionally I also wanted to use jQuery to access and modify elements more easily.

Loading the page in PhantomJS

The APIs appear to have changed drastically since the accepted answer. Additionally, I used a different approach (derived from this answer), which will be important in mitigating one of the major issues I encountered.

var system = require('system');
var fs = require('fs');
var page = require('webpage').create();

// Reading the page's content into your "webpage"
// This automatically refreshes the page
page.content = fs.read(system.args[1]);

// Make all your changes here

fs.write(system.args[2], page.content, 'w');
phantom.exit();

Preventing JavaScript from Running

My page uses Google Analytics in the footer, and now the page is modified beyond my intention, presumably because javascript was run. If we disable javascript, we can't actually use jQuery to modify the page, so that isn't an option. I've tried temporarily changing the tag, but when I do, every special character is replaced with an html-escaped equivalent, destroying all javascript code on the page. Then, I came across this answer, which gave me the following idea.

var rawPageString = fs.read(system.args[1]);
rawPageString = rawPageString.replace(/<script type="text\/javascript"/g, "<script type='foo/bar'");
rawPageString = rawPageString.replace(/<script>/g, "<script type='foo/bar'>");

page.content = rawPageString;

// Make all your changes here

rawPageString = page.content;
rawPageString = rawPageString.replace(/<script type='foo\/bar'/g, "<script");

Adding jQuery

There's actually an example on how to use jQuery. However, I thought an offline copy would be more appropriate. Initially I tried using page.includeJs as in the example, but found that page.injectJs was more suitable for the use case. Unlike includeJs, there's no <script> tag added to the page context, and the call blocks execution which simplifies the code. jQuery was placed in the same directory I was executing my script from.

page.injectJs("jquery-2.1.4.min.js");
page.evaluate(function () {

  // Make all changes here

  // Remove the foo/bar type more easily here
  $("script[type^=foo]").removeAttr("type");
});

fs.write(system.args[2], page.content, 'w');
phantom.exit();

Putting it All Together

var system = require('system');
var fs = require('fs');
var page = require('webpage').create();

var rawPageString = fs.read(system.args[1]);
// Prevent in-page javascript execution
rawPageString = rawPageString.replace(/<script type="text\/javascript"/g, "<script type='foo/bar'");
rawPageString = rawPageString.replace(/<script>/g, "<script type='foo/bar'>");

page.content = rawPageString;

page.injectJs("jquery-2.1.4.min.js");
page.evaluate(function () {

  // Make all changes here

  // Remove the foo/bar type
  $("script[type^=foo]").removeAttr("type");
});

fs.write(system.args[2], page.content, 'w');
phantom.exit();

Using it from the command line:

phantomjs modify-html-file.js "input_file.html" "output_file.html"

Note: This was tested and working with PhantomJS 2.0.0 on Windows 8.1.

Pro tip: If speed matters, you should consider iterating the files from within your PhantomJS script rather than a shell script. This will avoid the latency that PhantomJS has when starting up.

like image 23
Adam Heinermann Avatar answered Oct 20 '22 12:10

Adam Heinermann