<p>I've just run into a pathological case with HTML parsing. I've always thought that a <code><script></code> tag would run until the first closing <code></script></code> tag. But it turns out this is <em>not always the case</em>.</p> <p>This is valid:</p> <pre class="prettyprint"><code><script></script> </code></pre> <p>And even this is valid:</p> <pre class="prettyprint"><code><script></script> </code></pre> <p>And neither is this:</p> <pre class="prettyprint"><code><script> alert('<script></script>'); </script> </code></pre> <p>This behavior is consistent in Firefox and Chrome. So, as hard as it is to believe, browsers seem to accept an open+close script tag inside an html comment inside a script tag. So the question is how do browser <em>really</em> parse script tags? This matters because the HTML parsing library I'm using, Nokogiri, assumed the obvious (but incorrect) until-the-first-closing-tag rule and did not handle this edge case. I imagine most other libraries would not handle it either.</p>

<p>After poring over the links given by Tim and Jukka I came to the following answer:</p> <ul> <li>after the opening <code><script></code> tag, the parser goes to <em>data1</em> state</li> <li>if <code></code> is encountered while in any state, switch to <em>data1</em> state</li> <li>if <code><script[\s/>]</code> is encountered while in <em>data2</em> state, switch to <em>data3</em> state</li> <li>if <code></script[\s/>]</code> is encountered while in <em>data3</em> state, switch to <em>data2</em> state</li> <li>if <code></script[\s/>]</code> is encountered while in any other state, stop parsing </li> </ul>

How do browsers parse a script tag exactly?

Tags:

I've just run into a pathological case with HTML parsing. I've always thought that a <script> tag would run until the first closing </script> tag. But it turns out this is not always the case.

This is valid:

<script><!--
alert('<script></script>');
--></script>

And even this is valid:

<script><!--
alert('<script></script>');
</script>

But this is not:

<script><!--
alert('</script>');
--></script>

And neither is this:

<script>
alert('<script></script>');
</script>

This behavior is consistent in Firefox and Chrome. So, as hard as it is to believe, browsers seem to accept an open+close script tag inside an html comment inside a script tag. So the question is how do browser really parse script tags? This matters because the HTML parsing library I'm using, Nokogiri, assumed the obvious (but incorrect) until-the-first-closing-tag rule and did not handle this edge case. I imagine most other libraries would not handle it either.

838

asked Jan 29 '13 01:01

Daniel

2 Answers

After poring over the links given by Tim and Jukka I came to the following answer:

after the opening <script> tag, the parser goes to data1 state
if <!-- is encountered while in data1 state, switch to data2 state
if --> is encountered while in any state, switch to data1 state
if <script[\s/>] is encountered while in data2 state, switch to data3 state
if </script[\s/>] is encountered while in data3 state, switch to data2 state
if </script[\s/>] is encountered while in any other state, stop parsing

192

answered Oct 07 '22 17:10

Daniel

All the examples are invalid as per the HTML 4.01 specification: the content of script is declared as CDATA, and the description of CDATA says:

“Although the STYLE and SCRIPT elements use CDATA for their data model, for these elements, CDATA must be handled differently by user agents. Markup and entities must be treated as raw text and passed to the application as is. The first occurrence of the character sequence "</" (end-tag open delimiter) is treated as terminating the end of the element's content. In valid documents, this would be the end tag for the element.”

As you have observed, browsers might not enforce this rule but instead recognize pairs of start and end tags, in some situations. From the spec perspective, this is handling of invalid documents, i.e. error processing. It is not clear what exactly they are doing here and why. It seems to depend on the presence of <!--, which should not have any effect on HTML 4.01 parsing (it is not a comment opener in CDATA content).

In XHTML, partly different rules apply, because in XHTML, <!-- opens a comment within the content of a script element.

As an aside, all the examples are invalid HTML 4.01 and invalid XHTML due to the lack of the type attribute in script. The attribute is not needed (browsers default to treating the content as JavaScript), but it’s required by those specs.

In HTML5, other rules apply. They are rather complicated, and they are supposed to describe browser behavior. In addition to imposing restrictions on content (forbidding e.g. ), HTML5 also specifies parsing rules.

answered Oct 07 '22 18:10

Jukka K. Korpela

Related questions
                            
                                pandas HDFStore - how to reopen?
                            
                                Is it safe to mix pthread.h and C++11 standard library threading features?
                            
                                force cmake FIND_LIBRARY to look in custom directory
                            
                                Compiler thinks that "A(A&)" accepts rvalues for a moment?
                            
                                what is {version} in ScriptBundle("~/scripts/jquery-{version}.js")
                            
                                How does a mutex ensure a variable's value is consistent across cores?
                            
                                Month name in genitive (Polish locale) with Joda-Time DateTimeFormatter
                            
                                Isn't blindly using InvokeRequired just bad practice?
                            
                                How can I debug a corrupt docx file?
                            
                                iOS Safari Vertical Scrolling Feels Sticky (With No Momentum)
                            
                                Wrong type in Java conditional assignment
                            
                                In Open Graph markup, what's the use of 'og:locale:alternate' without the location (href)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do browsers parse a script tag exactly?

Tags:

Daniel

People also ask

2 Answers

Daniel

Jukka K. Korpela

Recent Activity

Donate For Us