Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the correct way to detect whether string inputs contain HTML or not?

Tags:

When receiving user input on forms I want to detect whether fields like "username" or "address" does not contain markup that has a special meaning in XML (RSS feeds) or (X)HTML (when displayed).

So which of these is the correct way to detect whether the input entered doesn't contain any special characters in HTML and XML context?

if (mb_strpos($data, '<') === FALSE AND mb_strpos($data, '>') === FALSE) 

or

if (htmlspecialchars($data, ENT_NOQUOTES, 'UTF-8') === $data) 

or

if (preg_match("/[^\p{L}\-.']/u", $text)) // problem: also caches symbols 

Have I missed anything else,like byte sequences or other tricky ways to get markup tags around things like "javascript:"? As far as I'm aware, all XSS and CSFR attacks require < or > around the values to get the browser to execute the code (well at least from Internet Explorer 6 or later anyway) - is this correct?

I am not looking for something to reduce or filter input. I just want to locate dangerous character sequences when used in XML or HTML context. (strip_tags() is horribly unsafe. As the manual says, it doesn't check for malformed HTML.)

Update

I think I need to clarify that there are a lot people mistaking this question for a question about basic security via "escaping" or "filtering" dangerous characters. This is not that question, and most of the simple answers given wouldn't solve that problem anyway.

Update 2: Example

  • User submits input
  • if (mb_strpos($data, '<') === FALSE AND mb_strpos($data, '>') === FALSE)
  • I save it

Now that the data is in my application I do two things with it - 1) display in a format like HTML - or 2) display inside a format element for editing.

The first one is safe in XML and HTML context

<h2><?php print $input; ?></h2>' <xml><item><?php print $input; ?></item></xml>

The second form is more dangerous, but it should still be safe:

<input value="<?php print htmlspecialchars($input, ENT_QUOTES, 'UTF-8');?>">

Update 3: Working Code

You can download the gist I created and run the code as a text or HTML response to see what I'm talking about. This simple check passes the http://ha.ckers.org XSS Cheat Sheet, and I can't find anything that makes it though. (I'm ignoring Internet Explorer 6 and below).

I started another bounty to award someone that can show a problem with this approach or a weakness in its implementation.

Update 4: Ask a DOM

It's the DOM that we want to protect - so why not just ask it? Timur's answer lead to this:

function not_markup($string) {     libxml_use_internal_errors(true);     if ($xml = simplexml_load_string("<root>$string</root>"))     {         return $xml->children()->count() === 0;     } }  if (not_markup($_POST['title'])) ... 
like image 284
Xeoncross Avatar asked Dec 07 '11 16:12

Xeoncross


People also ask

How do I check if a string contains HTML?

test. bind(/(<([^>]+)>)/i); It will basically return true for strings containing a < followed by ANYTHING followed by > .

Does HTML have strings?

HTML documents are strings that contain both content and markup. Content looks like: hi there and markup looks like <p> . In HTML they are blended together so that the string <p>hi there</p> tells the browser to display the words hi there to the screen in whatever a paragraph, according to the browser, looks like.


1 Answers

I don't think you need to implement a huge algorithm to check if string has unsafe data - filters and regular expressions do the work. But, if you need a more complex check, maybe this will fit your needs:

<?php $strings = array(); $strings[] = <<<EOD     ';alert(String.fromCharCode(88,83,83))//\';alert(String.fromCharCode(88,83,83))//";alert(String.fromCharCode(88,83,83))//\";alert(String.fromCharCode(88,83,83))//--></SCRIPT>">'><SCRIPT>alert(String.fromCharCode(88,83,83))</SCRIPT> EOD; $strings[] = <<<EOD     '';!--"<XSS>=&{()} EOD; $strings[] = <<<EOD     <SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT> EOD; $strings[] = <<<EOD     This is a safe text EOD; $strings[] = <<<EOD     <IMG SRC="javascript:alert('XSS');"> EOD; $strings[] = <<<EOD     <IMG SRC=javascript:alert('XSS')> EOD; $strings[] = <<<EOD     <IMG SRC=&#106;&#97;&#118;&#97;&#115;&#99;&#114;&#105;&#112;&#116;&#58;&#97;&#108;&#101;&#114;&#116;&#40;&#39;&#88;&#83;&#83;&#39;&#41;> EOD; $strings[] = <<<EOD     perl -e 'print "<IMG SRC=java\0script:alert(\"XSS\")>";' > out EOD; $strings[] = <<<EOD     <SCRIPT/XSS SRC="http://ha.ckers.org/xss.js"></SCRIPT> EOD; $strings[] = <<<EOD     </TITLE><SCRIPT>alert("XSS");</SCRIPT> EOD;    libxml_use_internal_errors(true); $sourceXML = '<root><element>value</element></root>'; $sourceXMLDocument = simplexml_load_string($sourceXML); $sourceCount = $sourceXMLDocument->children()->count();  foreach( $strings as $string ){     $unsafe = false;     $XML = '<root><element>'.$string.'</element></root>';     $XMLDocument = simplexml_load_string($XML);     if( $XMLDocument===false ){         $unsafe = true;     }else{          $count = $XMLDocument->children()->count();         if( $count!=$sourceCount ){             $unsafe = true;         }     }      echo ($unsafe?'Unsafe':'Safe').': <pre>'.htmlspecialchars($string,ENT_QUOTES,'utf-8').'</pre><br />'."\n"; } ?> 
like image 81
Timur Avatar answered Oct 13 '22 18:10

Timur