Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the maximum depth of HTML documents in practice?

Tags:

I want to allow embedding of HTML but avoid DoS due to deeply nested HTML documents that crash some browsers. I'd like to be able to accommodate 99.9% of documents, but reject those that nest too deeply.

Two closely related question:

  1. What document depth limits are built into browsers? E.g. browser X fails to parse or does not build documents with depth > some limit.
  2. Are document depth statistics for documents available on the web? Is there a site with web statistics that explains that some percentage of real documents on the web have document depths less than some value.

Document depth is defined as 1 + the maximum number of parent traversals needed to reach the document root from any node in a document. For example, in

<html>                   <!-- 1 -->   <body>                 <!-- 2 -->     <div>                <!-- 3 -->       <table>            <!-- 4 -->         <tbody>          <!-- 5 -->           <tr>           <!-- 6 -->             <td>         <!-- 7 -->               Foo        <!-- 8 --> 

the maximum depth is 8 since the text node "Foo" has 8 ancestors. Ancestor here is interpreted non-strictly, i.e. ever node is its own ancestor and its own descendent.

Opera has some table nesting stats, which suggest that 99.99% of documents have a table nesting depth of less than 22, but that data does not contain whole document depth.

EDIT:

If people would like to criticize the HTML sanitization library instead of answering this question, please do. http://code.google.com/p/owasp-java-html-sanitizer/wiki/AttackReviewGroundRules explains how to find the code, where to find a testbed that lets you try out attacks, and how to report issues.

EDIT:

I asked Adam Barth, and he very kindly pointed me to webkit code that handles this.

Webkit, at least, enforces this limit. When a treebuilder is created it receives a tree limit that is configurable:

m_treeBuilder(HTMLTreeBuilder::create(this, document, reportErrors, usePreHTML5ParserQuirks(document), maximumDOMTreeDepth**(document))) 

and it is tested by the block-nesting-cap test.

like image 614
Mike Samuel Avatar asked Oct 14 '11 16:10

Mike Samuel


People also ask

How many HTML documents are there?

There are three categories of HTML: transitional, strict, and frameset. Transitional is the most common type of HTML while the strict type of HTML is meant to return rules to HTML and make it more reliable. Frameset allows Web developers to create a mosaic of HTML documents and a menu system.

What are the three major components of HTML document?

An HTML 4.0 document generally consists of three parts: a line containing version information, a descriptive header section, and a body, which contains the document's actual content.


1 Answers

It may be worth asking [email protected]. Their study from 2005 (http://code.google.com/webstats/) doesn't cover your particular question. They sampled more than a billion documents though, and are interested in hearing about anything you feel is worth examining.

--[Update]--

Here's a crude script I wrote to test the browsers I have (putting the number of elements to nest into the query string):

var n = Number(window.location.search.substring(1));  var outboundHtml = ''; var inboundHtml = '';  for(var i = 0; i < n; i++) {     outboundHtml += '<div>' + (i + 1);     inboundHtml += '</div>'; }  var testWindow = window.open(); testWindow.document.open(); testWindow.document.write(outboundHtml + inboundHtml); testWindow.document.close(); 

And here are my findings (may be specific to my machine, Win XP, 3Gb Ram):

  • Chrome 9: 3218 nested elements will render, 3129 crashes tab. (Chrome 9 is old I know, the updater fails on my corporate LAN)
  • Safari 5: 3477 will render, 3478 browser closes completely.
  • IE8: 1000000+ will render (memory permitting), although performance degrades significantly when into high 4-figure numbers due to event bubbling when scrolling/moving the mouse/etc. Anything over 10000 appears to lock up, but I think is just taking a very long time, so is effective DoS.
  • Opera 11: Just limited by memory as far as I can tell, i.e. my script runs out of memory for 10000000. For large documents that do render though, there doesn't seem to be any performance degradation like in IE.
  • Firefox 3.6: ~1500000 will render but testing above this range resulted in the browser crashing with Mozilla Crash Reporter or just hanging, sometimes a number which worked would fail a subsequent time, but larger numbers ~1700000 would crash Firefox straight from a restart.

More on Chrome:

Changing the DIV to a SPAN resulted in Chrome being able to nest 9202 elements before crashing. So it's not the size of the HTML that is the reason (although SPAN elements may be more lightweight).

Nesting 2077 table cells (<table><tr><td>) worked (6231 elements), until you scrolled down to cell 445, then it crashed, so you can't nest 445 Table Cells (1335 elements).

Testing with files generated from the script (as opposed to writing to new windows) give slightly higher tolerances, but Chrome still crashed.

You can nest 1409 list items (<ul><li>) before it crashes, which is interesting because:

  • Firefox stops indenting list items after 99, a programmatic constraint maybe.
  • Opera's keeps indenting with glitches at 250, 376, 502, 628, 754, 880...

Setting a DOCTYPE is effective in IE8 (putting it into standards mode, i.e. var outboundHtml = '<!DOCTYPE html>';): It will not nest 792 list items (the tab crashes/closes) or 1593 DIVs. It made no difference in IE8 whether the test was generated from the script or loaded from a file.

So the nesting limit of a browser apparently depends on the type of HTML elements the attacker is injecting, and the layout engine. There could be some HTML considerably smaller than this. And we have a plain-HTML DoS for IE8, Chrome and Safari users with a considerably small payload.

It seems if you are going to allow users to post HTML that gets rendered on one of your pages, it is worth considering a limit on nested elements if there is a generous size limit.

like image 187
Lee Kowalkowski Avatar answered Oct 19 '22 08:10

Lee Kowalkowski