I've been hunting around the net now for a few days trying to figure this out but getting conflicting answers.
Is there a library, class or function for PHP that securely sanitizes/encodes a string against XSS? It needs to be updated regularly to counter new attacks.
I have a few use cases:
Use case 1) I have a plain text field, say for a First Name or Last Name
I'm thinking I could just do trim()
and strip_tags()
then use a Sanitize Filter or RegEx with a whitelist of characters. Do they really need characters like ! and ? or <
>
in their name, not really.
Use case 2) When outputting the contents from a previously saved database record (or from a previously submitted form) to the View/HTML I want to thoroughly clean it for XSS. NB: It may or may not have gone through the filtering step in use case 1 as it could be a different type of input, so assume no sanitizing has been done.
Initially I though HTMLPurifier would do the job, but as it seems it is not what I need when I posed the question to their support:
Here's the litmus test: if a user submits
<b>foo</b>
should it show up as<b>foo</b>
or foo? If the former, you don't need HTML Purifier.
So I'd rather it showed up as <b>foo</b>
because I don't want any HTML displayed for a simple text field or any JavaScript executing.
So I've been hunting around for a function that will do it all for me. I stumbled across the xss_clean method used by Kohana 3.0 which I'm guessing works but it's only if you want to keep the HTML. It's now deprecated from Kohana 3.1 as they've replaced it with HTMLPurifier. So I'm guessing you're supposed to do HTML::chars()
instead which only does this code:
public static function chars($value, $double_encode = TRUE)
{
return htmlspecialchars( (string) $value, ENT_QUOTES, Kohana::$charset, $double_encode);
}
Now apparently you're supposed to use htmlentities instead as mentioned in quite a few places in Stack Overflow because it's more secure than htmlspecialchars.
Now I see that the 3rd parameter for the htmlentities method is the charset to be used in conversion. Now my site/db is in UTF-8, but perhaps the form submitted data was not UTF-8 encoded, maybe they submitted ASCII or HEX so maybe I need to convert it to UTF-8 first? That would mean some code like:
$encoding = mb_detect_encoding($input);
$input = mb_convert_encoding($input, 'UTF-8', $encoding);
$input = htmlentities($input, ENT_QUOTES, 'UTF-8');
Yes or no? Then I'm still not sure how to protect against the hex, decimal and base64 possible XSS inputs...
If there's some library or open source PHP framework that can do XSS protection properly I'd be interested to see how they do it in code.
Any help much appreciated, sorry for the long post!
To answer the bold question: Yes, there is. It's called htmlspecialchars
.
It needs to be updated regularly to counter new attacks.
The right way to prevent XSS attacks is not countering specific attacks, filtering/sanitizing data, but proper encoding, everywhere.
htmlspecialchars
(or htmlentities
) in conjunction with a reasonable decision of character encoding (i.e. UTF-8
) and explicit specification of character encoding is sufficient to prevent against all XSS attacks. Fortunately, calling htmlspecialchars
without explicit encoding(it then assumes ISO-8859-1) happens to work out for UTF-8, too. If you want to make that explicit, create a helper function:
// Don't forget to specify UTF-8 as the document's encoding
function htmlEncode($s) {
return htmlspecialchars($s, ENT_QUOTES, 'UTF-8');
}
Oh, and to address the form worries: Don't try to detect encodings, it's bound to fail. Instead, give out the form in UTF-8. Every browser will send user inputs in UTF-8 then.
(...) you're supposed to use htmlentities because htmlspecialchars is vulnerable to UTF-7 XSS exploit.
The UTF-7 XSS exploit can only be applied if the browser thinks a document is encoded in UTF-7. Specifying the document encoding as UTF-8 (in the HTTP header/a meta tag right after <head>
) prevents this.
Also if I don't detect the encoding, what's to stop an attacker downloading the html file, then altering it to UTF-7 or some other encoding, then submitting the POST request back to my server from the altered html page?
This attack scenario is unnecessarily complex. The attacker could just craft a UTF-7 string, no need to download anything.
If you accept the attacker's POST (i.e. you're accepting anonymous public user input), your server will just interpret the UTF-7 string as a weird UTF-8 one. That is not a problem, the attacker's post will just show garbled. The attacker could achieve the same effect (sending strange text) by submitting "grfnlk" a hundred times.
If my method only works for UTF-8 then the XSS attack will get through, no?
No, it won't. Encodings are not magic. An encoding is just a way to interpret a binary string. For example, the string "ö" is encoded as (hexadecimal) 2B 41 50 59
in UTF-7 (and C3 B6
in UTF-8). Decoding 2B 41 50 59
as UTF-8 yields "+APY" - harmless, seemingly randomly characters.
Also how does htmlentities protect against HEX or other XSS attacks?
Hexadecimal data will be outputted as just that. An attacker sending "3C" will post a message "3C". "3C" can only become <
if you actively try to interpret hexadecimal inputs otherwise, for example actively map them into unicode code points and then output them. That just means if you're accepting data in something but plain UTF-8 (for example base32-encoded UTF-8), you'll first have to unpack your encoding, and then use htmlspecialchars
before including it between HTML code.
Lots of security engineers are recommending to use this library for this specific problem :
https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With