Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load HTML string into DOM tree with Javascript

I'm currently working with an automation framework that is pulling a webpage down for analysis, which is then presented as a string for processing. The Rhino Javascript engine is available to assist in parsing the returned web page.

It seems that if the string (which is a complete webpage) can be loaded in a DOM representation, it would provide a very nice interface for parsing and analyzing content.

Using only Javascript, is this a possible and/or feasible concept?

Edit:

I'll decompose the question for clarify: Say I have an string in javascript that contains html like such:


var $mywebpage = '<!DOCTYPE HTML PUB ...//snipped//... </body></html>';

is it possible/realistic to load it somehow into a dom object?

like image 964
xelco52 Avatar asked Feb 04 '11 22:02

xelco52


2 Answers

I'm accepting JonDavidJohn's answer as it was useful in solving my problem, thought including this additional answer for others that may view this in the future.

It appears that while Javascript allows the loading of html strings into a DOM element, DOM is not part of core ECMAScript, and as such is not available to scripts running under Rhino.

As a side note worth mentioning, a good alternative that was implemented in Rhino 1.6 is E4X. While not a DOM implementation, it does provide for conceptually similar capabilities.

like image 143
xelco52 Avatar answered Oct 23 '22 06:10

xelco52


If the document is XHTML, you can parse it with any XML parser. E4X would probably do the job nicely, as would the built-in Java XML parsing interfaces.

The env.js library is designed to emulate the browser environment under Rhino, but I believe your document also needs to be compliant XHTML:

http://ejohn.org/blog/bringing-the-browser-to-the-server/

http://www.envjs.com/

If it's HTML, however, it's more difficult, as browsers are designed to be extremely lenient in how markup is parsed. See here for a list of HTML parsers in Java:

http://java-source.net/open-source/html-parsers

This is not an easy problem to solve. People have gone so far as to embed the Mozilla Gecko engine in Java via JNI in order to use its parsing capabilities.

I would recommend you look into the following pure-Java project:

http://lobobrowser.org/cobra.jsp

The goal of the Lobo project is to develop a pure-Java web browser. It's a pretty interesting project, and there's a lot there, but I believe you could use the parser standalone quite easily in your own application, as described in the following link:

http://lobobrowser.org/cobra/java-html-parser.jsp

like image 39
jbeard4 Avatar answered Oct 23 '22 07:10

jbeard4