Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP DOMDocument: Errors while parsing unescaped strings

I'm having an issue while parsing HTML with PHP's DOMDocument.

The HMTL i'm parsing has the following script tag:

<script type="text/javascript">
    var showShareBarUI_params_e81 =
    {
        buttonWithCountTemplate: '<div class="sBtnWrap"><a href="#" onclick="$onClick"><div class="sBtn">$text<img src="$iconImg" /></div><div class="sCountBox">$count</div></a></div>',
    }
</script>

This snippet has two problems:

1) The HTML inside the buttonWithCountTemplate var is not escaped. DOMDocument manages this correctly, escaping the characters when parsing it. Not a problem.

2) Near the end, there's a img tag with an unescaped closing tag:

<img src="$iconImg" />

The /> makes DOMDocument think that the script is finished but it lacks the closing tag. If you extract the script using getElementByTagName you'll get the tag closed at this img tag, and the rest will appear as text on the HTML.

My goal is to remove all scripts in this page, so if I do a removeChild() over this tag, the tag is removed but the following part appears as text when rendering the page:

</div><div class="sCountBox">$count</div></a></div>',
        }
    </script>

Fixing the HTML is not a solution because I'm developing a generic parser and needs to handle all types of HTML.

My question is if I should do any sanitization before feeding the HTML to DOMDocument, or if there's an option to enable on DOMDocument to avoid triggering this issue, or even if I can strip all tags before loading the HTML.

Any ideas?


EDIT

After some research, I found out the real problem of the DOMDocument parser. Consider the following HTML:

<div> <!-- Offending div without closing tag -->
<script type="text/javascript">
       var test = '</div>';
       // I should not appear on the result
</script>

Using the following php code to remove script tags (based on Gholizadeh's answer):

<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
libxml_use_internal_errors(true);
$dom->loadHTML(file_get_contents('js.html'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
//@$dom->loadHTMLFile('script.html'); //fix tags if not exist

while($nodes = $dom->getElementsByTagName("script")) {
    if($nodes->length == 0) break;
    $script = $nodes->item(0);
    $script->parentNode->removeChild($script);
}

//return $dom->saveHTML();
$final = $dom->saveHTML();
echo $final;

The result will be the following:

<div> <!-- Offending div without closing tag -->
<p>';
       // I should not appear on the result
</p></div>

The problem is that the first div tag is not closed and seems that DOMDocument takes the div tags inside the JS string as html instead of a simple JS string.

What can I do to solve this? Remember that modifing the HTML is not an option, since I'm developing a generic parser.

like image 600
Andres Avatar asked Nov 20 '16 11:11

Andres


3 Answers

I tested the following code on a html file like this:

<p>some text 1</p>
<img src="http://www.example.com/images/some_image_1.jpg">
<p>some text 2</p>
<p>some text 3</p>
<img src="http://www.example.com/images/some_image_2.jpg">

<script type="text/javascript">
    var showShareBarUI_params_e81 =
    {
        buttonWithCountTemplate: '<div class="sBtnWrap"><a href="#" onclick="$onClick"><div class="sBtn">$text<img src="$iconImg" /></div><div class="sCountBox">$count</div></a></div>',
    }
</script>

<p>some text 4</p>
<p>some text 5</p>
<img src="http://www.example.com/images/some_image_3.jpg">

the php code is:

<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);

    $dom = new DOMDocument;
    $dom->preserveWhiteSpace = false;
    @$dom->loadHTML(file_get_contents('script.html'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    //@$dom->loadHTMLFile('script.html'); //fix tags if not exist 

    $nodes = $dom->getElementsByTagName("script");

    foreach($nodes as $i => $node){
        $script = $nodes->item($i);
        $script->parentNode->removeChild($script);
    }

    //return $dom->saveHTML();
    $dom->saveHtmlFile('script.html');

and it works on the given example I think you should use options I used in loading html code.

Edited according to last question updates:

Actually You can't parse [X]HTML with regex (read this link for more information) but if your only purpose is to remove just script tags and you can make sure there is no </script> tag as a string between it. you can use this regex:

$html = mb_convert_encoding(file_get_contents('script2.html'), 'HTML-ENTITIES', 'UTF-8');
$new_html = preg_replace('/<script(.*?)>(.*?)<\/script>/si', '', $html);
file_put_contents('script-result.html', $new_html);

frankly the problem is you may have not a standard HTML code. but I think it's better to try other libraries linked here.

otherwise I guess you should write a special parser to remove script tag and take care of single quote and double quotes inside.

like image 66
Saeed.Gh Avatar answered Oct 17 '22 13:10

Saeed.Gh


i am offering different aproach to your problem:

My goal is to remove all scripts in this page

then you can remove them with preg_replace_callback function and parse the html as DOM after that. Here is working demo: demo

$htmlWithScript = "<html><body><div>something></div><script type=\"text/javascript\">
var showShareBarUI_params_e81 =
{
    buttonWithCountTemplate: '<div class=\"sBtnWrap\"><a href=\"#\" onclick=\"\$onClick\"><div class=\"sBtn\">\$text<img src=\"\$iconImg\" /></div><div class=\"sCountBox\">\$count</div></a></div>',
}
</script></body></html>";



$htmlWithoutScript = preg_replace_callback('~<script.*>.*</script>~Uis', function($matches){
return '';
}, $htmlWithScript);

EDIT

But how do I do this without summoning Cthulhu?

nice comment, but i don't know what you are asking :) If it is loading the html, then you can load html with file_get_contents()

If you do not understand how it will remove tags: preg_replace_callback allows you to search matches against regexp and transform them. In this situation remove them (return '';) Regexp is looking for starting tag of with any attributes (.*) and any content between ending tag

Modificators:

U -> means ungreedy (shortest match possible)

i -> case insensitive ( will be matched as well)

s -> whitespace is included in . (dot) characted (newline will not break match)

I hope this clarifies it a bit..

like image 3
Jimmmy Avatar answered Oct 17 '22 14:10

Jimmmy


Have you tried setting libxml to use internal errors?

$use_errors = libxml_use_internal_errors(true);
// your parsing code here
libxml_clear_errors();
libxml_use_internal_errors($use_errors);

It might allow dom document to continue parsing(maybe).

like image 2
Tschallacka Avatar answered Oct 17 '22 15:10

Tschallacka