Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML Tidy fails on script tag in JavaScript string literal

I'm using HTML Tidy in PHP and it's producing unexpected results because of a <script> tag in a JavaScript string literal. Here's a sample input:

<html>
<script>
var t='<script><'+'/script>';
</script>
</html>

HTML Tidy's output:

<html>
<script>
//<![CDATA[
var t='<script><'+'/script>';
<\/script>
<\/html>
//]]>
</script>
</html>

It's interpreting </script></html> as part of the script. Then, it adds another </script></html> to close the open tags. I tried this on an online version of HTML Tidy (http://www.dirtymarkup.com/) and it's producing the same error.

How do I prevent this error from occurring in PHP?

like image 558
Leo Jiang Avatar asked Feb 26 '14 00:02

Leo Jiang


2 Answers

After playing around with it a bit I discovered that one can use comment //'<\/script>' to confuse the algorithm in a way to prevent this bug from occurring:

<html>
<script>
var t='<script><'+'/script>'; //'<\/script>'
</script>
</html>

After clean-up:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">

<html>
<head>

   <script>
var t='<script><'+'/script>'; //'<\/script>'
   </script>

   <title></title>
</head>

<body>
</body>
</html>

My guess is that as the clean-up algorithm looks through the codes and detects the string <script> twice, it looks for </script> immediately. And separting < with /script> makes the second </script> goes undetected, which is why it decided to add another </script> at the end of the codes and somehow also closed it with antoher </html>. (Poor design indeed!)

So I made a second assumption that there isn't an if-statement in the algorithm to determine if a </scirpt> is in a comment, and I was right! Having another string <\/script> as a javascript comment indeed makes the algorithm to think that there are two </script> in total.

like image 61
Archy Will He 何魏奇 Avatar answered Nov 04 '22 21:11

Archy Will He 何魏奇


There's no need for string concatenation to avoid the closing </script>. Simply escaping the / character is enough to "fool" the parsers in browsers and, it seems, HTML Tidy's parser as well:

<html>
<script>
var t='<script><\/script>';
</script>
</html>
like image 34
Nelson Menezes Avatar answered Nov 04 '22 23:11

Nelson Menezes