Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unclosed / misnested HTML tags extend past their parent

I'm running into some interesting functionality when HTML tags aren't closed. Sometimes the browser inserts extra opening and closing tags to compensate, and other times it just inserts a closing tag. This is best explained through examples:

With the <sup> tag:

first text node
<div> This is a parent div <sup>superscript tag starts IN parent</div> text OUTSIDE node of parent

With the <s> tag:

first text node
<div> This is a parent div <s>strikethrough tag starts IN parent</div> text OUTSIDE node of parent

As you can see in the first example the browser automatically closes the <sup> tag before its parent closes. However, in the second example the browser seems to close the <s> tag before the end of its parent and then inserts another starting <s> after the parent.

I've looked through the <s> and the <sup> specs - I can't seem to find anything specific to how browsers interpret and deal with unclosed tags.. At least nothing that explains this functionality.

The reason I'm wanting to know this is for a live markdown parser I'm using - users may not finish their tags before it parses their source.

I'd like to know how the browser deals with these things, so I can code for that use-case. At the present time the browser handles closing different tags in different ways (as you can see by my examples).

Does anyone know why the browser does this? Or at least know a list of elements that act the same?

like image 579
Aᴄʜᴇʀᴏɴғᴀɪʟ Avatar asked Sep 30 '16 00:09

Aᴄʜᴇʀᴏɴғᴀɪʟ


1 Answers

Thanks to @Ankith Amtange I found the explanation of what happens. I'll write it out here for future readers.

The <s> tag extends past its parent because it is a formatting element. The <sup> tag is automatically closed because the browser expected a closing </sup> tag before the end of the parent element.

The HTML parser treats elements differently in its stack, which fall into the following categories (source):

Special elements

  • The following elements have varying levels of special parsing rules: HTML's address, applet, area, article, aside, base, basefont, bgsound, blockquote, body, br, button, caption, center, col, colgroup, dd, details, dir, div, dl, dt, embed, fieldset, figcaption, figure, footer, form, frame, frameset, h1, h2, h3, h4, h5, h6, head, header, hgroup, hr, html, iframe, img, input, isindex, li, link, listing, main, marquee, meta, nav, noembed, noframes, noscript, object, ol, p, param, plaintext, pre, script, section, select, source, style, summary, table, tbody, td, template, textarea, tfoot, th, thead, title, tr, track, ul, wbr, and xmp; MathML's mi, mo, mn, ms, mtext, and annotation-xml; and SVG's foreignObject, desc, and title.

Formatting elements

  • The following HTML elements are those that end up in the list of active formatting elements: a, b, big, code, em, font, i, nobr, s, small, strike, strong, tt, and u.

Ordinary elements

  • All other elements found while parsing an HTML document.

Explanation (from linked spec):

The most-often discussed example of erroneous markup is as follows:

<p>1<b>2<i>3</b>4</i>5</p>

The parsing of this markup is straightforward up to the "3". At this point, the DOM looks like this:

─html
 ├──head
 └──body
    └──p
       ├──"1"
       └──b
          ├──"2"
          └──i
             └──"3"

Here, the stack of open elements has five elements on it: html, body, p, b, and i. The list of active formatting elements just has two: b and i. The insertion mode is "in body".

Upon receiving the end tag token with the tag name "b", the "adoption agency algorithm" is invoked. This is a simple case, in that the formatting element is the b element, and there is no furthest block. Thus, the stack of open elements ends up with just three elements: html, body, and p, while the list of active formatting elements has just one: i. The DOM tree is unmodified at this point.

The next token is a character ("4"), triggers the reconstruction of the active formatting elements, in this case just the i element. A new i element is thus created for the "4" Text node. After the end tag token for the "i" is also received, and the "5" Text node is inserted, the DOM looks as follows:

─html
 ├──head
 └──body
    └──p
       ├──"1"
       ├──b
       │  ├──"2"
       │  └──i
       │     └──"3"
       ├──i
       │  └──"4"
       └──"5"
like image 96
Aᴄʜᴇʀᴏɴғᴀɪʟ Avatar answered Oct 23 '22 12:10

Aᴄʜᴇʀᴏɴғᴀɪʟ