Here's something I've been having a little bit of difficulty with. I have a local client-side script that needs to allow a user to fetch a remote web page and search that resulting page for forms. In order to do this (without regex), I need to parse the document into a fully traversable DOM object.
Some limitations I'd like to stress:
getElementsByTagName
, need to be available.Assuming I have a complete HTML document string (including DOCTYPE declaration) in the variable html
, here's what I've tried so far:
var frag = document.createDocumentFragment(), div = frag.appendChild(document.createElement("div")); div.outerHTML = html; //-> results in an empty fragment div.insertAdjacentHTML("afterEnd", html); //-> HTML is not added to the fragment div.innerHTML = html; //-> Error (expected, but I tried it anyway) var doc = new ActiveXObject("htmlfile"); doc.write(html); doc.close(); //-> JavaScript executes
I've also tried extracting the <head>
and <body>
nodes from the HTML and adding them to a <HTML>
element inside the fragment, still no luck.
Does anyone have any ideas?
First, select the <ul> element by its id using the querySelector() method. Second, create a new document fragment. Third, for each element in the languages array, create a list item element, assign the list item's innerHTML to the language , and append all the newly created list items to the document fragment.
The DocumentFragment interface represents a minimal document object that has no parent. It is used as a lightweight version of Document that stores a segment of a document structure comprised of nodes just like a standard document.
To create a new HTML Fragment:Paste your HTML code into the large HTML Code box. Choose between the 'Manual', 'Head' and 'Body' types. Click the Add HTML Fragment button.
DocumentFragment s are DOM Node objects which are never part of the main DOM tree. The usual use case is to create the document fragment, append elements to the document fragment and then append the document fragment to the DOM tree. In the DOM tree, the document fragment is replaced by all its children.
Fiddle: http://jsfiddle.net/JFSKe/6/
DocumentFragment
doesn't implement DOM methods. Using document.createElement
in conjunction with innerHTML
removes the <head>
and <body>
tags (even when the created element is a root element, <html>
). Therefore, the solution should be sought elsewhere. I have created a cross-browser string-to-DOM function, which makes use of an invisible inline-frame.
All external resources and scripts will be disabled. See Explanation of the code for more information.
/* @param String html The string with HTML which has be converted to a DOM object @param func callback (optional) Callback(HTMLDocument doc, function destroy) @returns undefined if callback exists, else: Object HTMLDocument doc DOM fetched from Parameter:html function destroy Removes HTMLDocument doc. */ function string2dom(html, callback){ /* Sanitise the string */ html = sanitiseHTML(html); /*Defined at the bottom of the answer*/ /* Create an IFrame */ var iframe = document.createElement("iframe"); iframe.style.display = "none"; document.body.appendChild(iframe); var doc = iframe.contentDocument || iframe.contentWindow.document; doc.open(); doc.write(html); doc.close(); function destroy(){ iframe.parentNode.removeChild(iframe); } if(callback) callback(doc, destroy); else return {"doc": doc, "destroy": destroy}; } /* @name sanitiseHTML @param String html A string representing HTML code @return String A new string, fully stripped of external resources. All "external" attributes (href, src) are prefixed by data- */ function sanitiseHTML(html){ /* Adds a <!-\"'--> before every matched tag, so that unterminated quotes aren't preventing the browser from splitting a tag. Test case: '<input style="foo;b:url(0);><input onclick="<input type=button onclick="too() href=;>">' */ var prefix = "<!--\"'-->"; /*Attributes should not be prefixed by these characters. This list is not complete, but will be sufficient for this function. (see http://www.w3.org/TR/REC-xml/#NT-NameChar) */ var att = "[^-a-z0-9:._]"; var tag = "<[a-z]"; var any = "(?:[^<>\"']*(?:\"[^\"]*\"|'[^']*'))*?[^<>]*"; var etag = "(?:>|(?=<))"; /* @name ae @description Converts a given string in a sequence of the original input and the HTML entity @param String string String to convert */ var entityEnd = "(?:;|(?!\\d))"; var ents = {" ":"(?:\\s| ?|�*32"+entityEnd+"|�*20"+entityEnd+")", "(":"(?:\\(|�*40"+entityEnd+"|�*28"+entityEnd+")", ")":"(?:\\)|�*41"+entityEnd+"|�*29"+entityEnd+")", ".":"(?:\\.|�*46"+entityEnd+"|�*2e"+entityEnd+")"}; /*Placeholder to avoid tricky filter-circumventing methods*/ var charMap = {}; var s = ents[" "]+"*"; /* Short-hand space */ /* Important: Must be pre- and postfixed by < and >. RE matches a whole tag! */ function ae(string){ var all_chars_lowercase = string.toLowerCase(); if(ents[string]) return ents[string]; var all_chars_uppercase = string.toUpperCase(); var RE_res = ""; for(var i=0; i<string.length; i++){ var char_lowercase = all_chars_lowercase.charAt(i); if(charMap[char_lowercase]){ RE_res += charMap[char_lowercase]; continue; } var char_uppercase = all_chars_uppercase.charAt(i); var RE_sub = [char_lowercase]; RE_sub.push("�*" + char_lowercase.charCodeAt(0) + entityEnd); RE_sub.push("�*" + char_lowercase.charCodeAt(0).toString(16) + entityEnd); if(char_lowercase != char_uppercase){ RE_sub.push("�*" + char_uppercase.charCodeAt(0) + entityEnd); RE_sub.push("�*" + char_uppercase.charCodeAt(0).toString(16) + entityEnd); } RE_sub = "(?:" + RE_sub.join("|") + ")"; RE_res += (charMap[char_lowercase] = RE_sub); } return(ents[string] = RE_res); } /* @name by @description second argument for the replace function. */ function by(match, group1, group2){ /* Adds a data-prefix before every external pointer */ return group1 + "data-" + group2 } /* @name cr @description Selects a HTML element and performs a search-and-replace on attributes @param String selector HTML substring to match @param String attribute RegExp-escaped; HTML element attribute to match @param String marker Optional RegExp-escaped; marks the prefix @param String delimiter Optional RegExp escaped; non-quote delimiters @param String end Optional RegExp-escaped; forces the match to end before an occurence of <end> when quotes are missing */ function cr(selector, attribute, marker, delimiter, end){ if(typeof selector == "string") selector = new RegExp(selector, "gi"); marker = typeof marker == "string" ? marker : "\\s*="; delimiter = typeof delimiter == "string" ? delimiter : ""; end = typeof end == "string" ? end : ""; var is_end = end && "?"; var re1 = new RegExp("("+att+")("+attribute+marker+"(?:\\s*\"[^\""+delimiter+"]*\"|\\s*'[^'"+delimiter+"]*'|[^\\s"+delimiter+"]+"+is_end+")"+end+")", "gi"); html = html.replace(selector, function(match){ return prefix + match.replace(re1, by); }); } /* @name cri @description Selects an attribute of a HTML element, and performs a search-and-replace on certain values @param String selector HTML element to match @param String attribute RegExp-escaped; HTML element attribute to match @param String front RegExp-escaped; attribute value, prefix to match @param String flags Optional RegExp flags, default "gi" @param String delimiter Optional RegExp-escaped; non-quote delimiters @param String end Optional RegExp-escaped; forces the match to end before an occurence of <end> when quotes are missing */ function cri(selector, attribute, front, flags, delimiter, end){ if(typeof selector == "string") selector = new RegExp(selector, "gi"); flags = typeof flags == "string" ? flags : "gi"; var re1 = new RegExp("("+att+attribute+"\\s*=)((?:\\s*\"[^\"]*\"|\\s*'[^']*'|[^\\s>]+))", "gi"); end = typeof end == "string" ? end + ")" : ")"; var at1 = new RegExp('(")('+front+'[^"]+")', flags); var at2 = new RegExp("(')("+front+"[^']+')", flags); var at3 = new RegExp("()("+front+'(?:"[^"]+"|\'[^\']+\'|(?:(?!'+delimiter+').)+)'+end, flags); var handleAttr = function(match, g1, g2){ if(g2.charAt(0) == '"') return g1+g2.replace(at1, by); if(g2.charAt(0) == "'") return g1+g2.replace(at2, by); return g1+g2.replace(at3, by); }; html = html.replace(selector, function(match){ return prefix + match.replace(re1, handleAttr); }); } /* <meta http-equiv=refresh content=" ; url= " > */ html = html.replace(new RegExp("<meta"+any+att+"http-equiv\\s*=\\s*(?:\""+ae("refresh")+"\""+any+etag+"|'"+ae("refresh")+"'"+any+etag+"|"+ae("refresh")+"(?:"+ae(" ")+any+etag+"|"+etag+"))", "gi"), "<!-- meta http-equiv=refresh stripped-->"); /* Stripping all scripts */ html = html.replace(new RegExp("<script"+any+">\\s*//\\s*<\\[CDATA\\[[\\S\\s]*?]]>\\s*</script[^>]*>", "gi"), "<!--CDATA script-->"); html = html.replace(/<script[\S\s]+?<\/script\s*>/gi, "<!--Non-CDATA script-->"); cr(tag+any+att+"on[-a-z0-9:_.]+="+any+etag, "on[-a-z0-9:_.]+"); /* Event listeners */ cr(tag+any+att+"href\\s*="+any+etag, "href"); /* Linked elements */ cr(tag+any+att+"src\\s*="+any+etag, "src"); /* Embedded elements */ cr("<object"+any+att+"data\\s*="+any+etag, "data"); /* <object data= > */ cr("<applet"+any+att+"codebase\\s*="+any+etag, "codebase"); /* <applet codebase= > */ /* <param name=movie value= >*/ cr("<param"+any+att+"name\\s*=\\s*(?:\""+ae("movie")+"\""+any+etag+"|'"+ae("movie")+"'"+any+etag+"|"+ae("movie")+"(?:"+ae(" ")+any+etag+"|"+etag+"))", "value"); /* <style> and < style= > url()*/ cr(/<style[^>]*>(?:[^"']*(?:"[^"]*"|'[^']*'))*?[^'"]*(?:<\/style|$)/gi, "url", "\\s*\\(\\s*", "", "\\s*\\)"); cri(tag+any+att+"style\\s*="+any+etag, "style", ae("url")+s+ae("(")+s, 0, s+ae(")"), ae(")")); /* IE7- CSS expression() */ cr(/<style[^>]*>(?:[^"']*(?:"[^"]*"|'[^']*'))*?[^'"]*(?:<\/style|$)/gi, "expression", "\\s*\\(\\s*", "", "\\s*\\)"); cri(tag+any+att+"style\\s*="+any+etag, "style", ae("expression")+s+ae("(")+s, 0, s+ae(")"), ae(")")); return html.replace(new RegExp("(?:"+prefix+")+", "g"), prefix); }
The sanitiseHTML
function is based on my replace_all_rel_by_abs
function (see this answer). The sanitiseHTML
function is completely rewritten though, in order to achieve maximum efficiency and reliability.
Additionally, a new set of RegExps are added to remove all scripts and event handlers (including CSS expression()
, IE7-). To make sure that all tags are parsed as expected, the adjusted tags are prefixed by <!--'"-->
. This prefix is necessary to correctly parse nested "event handlers" in conjunction with unterminated quotes: <a id="><input onclick="<div onmousemove=evil()>">
.
These RegExps are dynamically created using an internal function cr
/cri
(Create Replace [Inline]). These functions accept a list of arguments, and create and execute an advanced RE replacement. To make sure that HTML entities aren't breaking a RegExp (refresh
in <meta http-equiv=refresh>
could be written in various ways), the dynamically created RegExps are partially constructed by function ae
(Any Entity).
The actual replacements are done by function by
(replace by). In this implementation, by
adds data-
before all matched attributes.
<script>//<[CDATA[ .. //]]></script>
occurrences are striped. This step is necessary, because CDATA
sections allow </script>
strings inside the code. After this replacement has been executed, it's safe to go to the next replacement:<script>...</script>
tags are removed.<meta http-equiv=refresh .. >
tag is removedAll event listeners and external pointers/attributes (href
, src
, url()
) are prefixed by data-
, as described previously.
An IFrame
object is created. IFrames are less likely to leak memory (contrary to the htmlfile ActiveXObject). The IFrame becomes invisible, and is appended to the document, so that the DOM can be accessed. document.write()
are used to write HTML to the IFrame. document.open()
and document.close()
are used to empty the previous contents of the document, so that the generated document is an exact copy of the given html
string.
document
object. The second argument is a function, which destroys the generated DOM tree when called. This function should be called when you don't need the tree any more.doc
and destroy
), which behave the same as the previously mentioned arguments.designMode
property to "On" will stop a frame from executing scripts (not supported in Chrome). If you have to preserve the <script>
tags for a specific reason, you can use iframe.designMode = "On"
instead of the script stripping feature.htmlfile activeXObject
. According to this source, htmlfile
is slower than IFrames, and more susceptible to memory leaks.href
, src
, ...) are prefixed by data-
. An example of getting/changing these attributes is shown for data-href
:elem.getAttribute("data-href")
and elem.setAttribute("data-href", "...")
elem.dataset.href
and elem.dataset.href = "..."
.<link rel="stylesheet" href="main.css" />
<script>document.body.bgColor="red";</script>
<img src="128x128.png" />
No images: the size of the element may be completely different. sanitiseHTML(html)
Paste this bookmarklet in the location's bar. It will offer an option to inject a textarea, showing the sanitised HTML string.
javascript:void(function(){var s=document.createElement("script");s.src="http://rob.lekensteyn.nl/html-sanitizer.js";document.body.appendChild(s)})();
Code examples - string2dom(html)
:
string2dom("<html><head><title>Test</title></head></html>", function(doc, destroy){ alert(doc.title); /* Alert: "Test" */ destroy(); }); var test = string2dom("<div id='secret'></div>"); alert(test.doc.getElementById("secret").tagName); /* Alert: "DIV" */ test.destroy();
sanitiseHTML(html)
is based on my previously created replace_all_rel_by_abs(html)
function.<applet>
)Not sure why you're messing with documentFragments, you can just set the HTML text as the innerHTML
of a new div element. Then you can use that div element for getElementsByTagName
etc without adding the div to DOM:
var htmlText= '<html><head><title>Test</title></head><body><div id="test_ele1">this is test_ele1 content</div><div id="test_ele2">this is test_ele content2</div></body></html>'; var d = document.createElement('div'); d.innerHTML = htmlText; console.log(d.getElementsByTagName('div'));
If you're really married to the idea of a documentFragment, you can use this code, but you'll still have to wrap it in a div to get the DOM functions you're after:
function makeDocumentFragment(htmlText) { var range = document.createRange(); var frag = range.createContextualFragment(htmlText); var d = document.createElement('div'); d.appendChild(frag); return d; }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With