When I see the answer to solve Level 15 of http://escape.alf.nu, I notice that <!--<script>
will cause the DOM parser to break. In the following HTML you won't see the string "Test" (tested on IE 11 & Firefox & Chrome):
<!DOCTYPE HTML> <html> <body> <script> var a = '<!--<script>'; </script> <p>Test</p> </body> </html>
But these two scripts will show "Test":
<!DOCTYPE HTML> <html> <body> <script> var a = '<!--'; </script> <p>Test</p> </body> </html>
And,
<!DOCTYPE HTML> <html> <body> <script> var a = '<script>'; </script> <p>Test</p> </body> </html>
I don't understand, why does this happen?
The browser reads the html file from top to bottom, creating the DOM Tree and when it sees a <script> , it stops to download and execute it until the parse goes through the whole page.
JavaScript CompilationThe scripts are parsed into abstract syntax trees. Some browser engines take the Abstract Syntax Tree and pass it into an interpreter, outputting bytecode which is executed on the main thread. This is known as JavaScript compilation.
The script sections of a web page are handled by the browser's JavaScript interpreter, which may be an intrinsic part of the browser but usually is a distinct module, sometimes even a completely distinct project (Chrome uses V8; IE uses JScript; Firefox uses SpiderMonkey; etc.).
This raises the important point that the text inside of <script>
tags on an HTML page is parsed by the HTML parser before it is parsed by the Javascript parser.
This code is not valid HTML5 syntax, so there is nothing in the HTML5 specification that would give us a clue about what is going one here. To be specific, there are two issues:
<script>
tag without a closing </script>
.<!--
without a closing -->
. (see restrictions for contents of script elements)Both of these problem will put a browser's HTML parser into an error parsing mode, which means they are trying to make sense of invalid syntax. What browsers will do when trying to make sense of invalid syntax is undefined behavior, which technically means that anything can happen (such as nasal demons). The de facto behavior here seems to be that browsers are agreeing on how they handle this undefined behavior, but it is undefined behavior nonetheless.
For whatever reason, this combination of syntax issues next to each other causes browsers to ignore the text later in the document.
EDIT: I have identified how the parsing error is produced by stepping through this part of the HTML5 spec.
The text content of the script (excluding whitespace) is
var a = '<!--<script>';
This must match the following grammar rule:
data1 *( escape [ script-start data3 ] "-->" data1 ) [ escape ]
We can begin parsing the text content by matching data1
, which has the following rule:
data1 = < any string that doesn't contain a substring that matches not-data1 > not-data1 = "<!--"
That is, the string var a = '
matches the data1
production. It ends there because the next part is <!--
.
For there to be any text afterwards in the script, it must match the escape
production, which is as follows:
escape = "<!--" data2 *( script-start data3 script-end data2 )
Let's match the next part of the text. So far we have
data1 var a = ' escape <!-- data2 ???
Now nothing can be contained in data2
because the data2
production prohibits the substring <script>
(i.e. a script-start
)!
data2 = < any string that doesn't contain a substring that matches not-data2 > not-data2 = script-start / "-->"
The lexer cannot proceed with with valid steps according to the grammar, so the browser must now go into error processing.
It 'll be some assumption being violated in the internal mechanism.
There's not much point trying to rationalise about this stuff.
You wrote invalid HTML, so anything can happen.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With