I've just run into a pathological case with HTML parsing. I've always thought that a <script>
tag would run until the first closing </script>
tag. But it turns out this is not always the case.
This is valid:
<script><!--
alert('<script></script>');
--></script>
And even this is valid:
<script><!--
alert('<script></script>');
</script>
But this is not:
<script><!--
alert('</script>');
--></script>
And neither is this:
<script>
alert('<script></script>');
</script>
This behavior is consistent in Firefox and Chrome. So, as hard as it is to believe, browsers seem to accept an open+close script tag inside an html comment inside a script tag. So the question is how do browser really parse script tags? This matters because the HTML parsing library I'm using, Nokogiri, assumed the obvious (but incorrect) until-the-first-closing-tag rule and did not handle this edge case. I imagine most other libraries would not handle it either.
The <script> tag is used to embed a client-side script (JavaScript). The <script> element either contains scripting statements, or it points to an external script file through the src attribute. Common uses for JavaScript are image manipulation, form validation, and dynamic changes of content.
JavaScript Compilation JavaScript is interpreted, compiled, parsed and executed. The scripts are parsed into abstract syntax trees. Some browser engines take the Abstract Syntax Tree and pass it into an interpreter, outputting bytecode which is executed on the main thread. This is known as JavaScript compilation.
HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.
Parsing means analyzing and converting a program into an internal format that a runtime environment can actually run . In other words, parsing means taking the code we write as text (HTML, CSS) and transform it into something that the browser can work with.
After poring over the links given by Tim and Jukka I came to the following answer:
<script>
tag, the parser goes to data1 state<!--
is encountered while in data1 state, switch to data2 state-->
is encountered while in any state, switch to data1 state<script[\s/>]
is encountered while in data2 state, switch to data3 state</script[\s/>]
is encountered while in data3 state, switch to data2 state</script[\s/>]
is encountered while in any other state, stop parsing All the examples are invalid as per the HTML 4.01 specification: the content of script
is declared as CDATA
, and the description of CDATA
says:
“Although the STYLE and SCRIPT elements use CDATA for their data model, for these elements, CDATA must be handled differently by user agents. Markup and entities must be treated as raw text and passed to the application as is. The first occurrence of the character sequence "
</
" (end-tag open delimiter) is treated as terminating the end of the element's content. In valid documents, this would be the end tag for the element.”
As you have observed, browsers might not enforce this rule but instead recognize pairs of start and end tags, in some situations. From the spec perspective, this is handling of invalid documents, i.e. error processing. It is not clear what exactly they are doing here and why. It seems to depend on the presence of <!--
, which should not have any effect on HTML 4.01 parsing (it is not a comment opener in CDATA
content).
In XHTML, partly different rules apply, because in XHTML, <!--
opens a comment within the content of a script
element.
As an aside, all the examples are invalid HTML 4.01 and invalid XHTML due to the lack of the type
attribute in script
. The attribute is not needed (browsers default to treating the content as JavaScript), but it’s required by those specs.
In HTML5, other rules apply. They are rather complicated, and they are supposed to describe browser behavior. In addition to imposing restrictions on content (forbidding e.g. <!--
without matching -->
), HTML5 also specifies parsing rules.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With