Lets say I have a web application that is using Latin1 or some default English language encoding. I want to change the application to use UTF-8 or perhaps another language encoding. Can you prove that this change will introduce XSS?
This is not a PHP specific question,  but in PHP can you show a case where htmlspecialchars($var,ENT_QUOTES); is vulnerable to XSS and htmlspecialchars($var,ENT_QUOTES,'UTF-8'); is not.
Here's a silly example that cheats by misusing htmlspecialchars from how you intended.
<?php
$s = htmlspecialchars($_GET['x'], ENT_QUOTES);
$s_utf8 = htmlspecialchars($_GET['x'], ENT_QUOTES, 'UTF-8');
if(!empty($s))
  print "default: " . $_GET['x'] . "<br>\n";
if(!empty($s_utf8))
  print "utf8: " . $_GET['x'] . "<br>\n"
?>
Submit any XSS payload and add an invalid UTF-8 byte, e.g.
http://site/silly.php?x=<script>alert(0)</script>%fe
htmlspecialchars bails on an invalid UTF-8 byte sequence and returns an empty string. Printing the $_GET value is an obvious hole, but I do have a point to make.
In short, you're going to get byte-by-byte checks with Latin1 and UTF-8 so I'm not aware of a language-dependent example where htmlspecialchars will miss a dangerous byte in one encoding, but not another.
The point of my example is that your question was more general (and perhaps a bit too vague) to the dangers of XSS when changing encoding schemes. When content starts dealing with different multi-byte encoding then developers may foul up validation filters based on strchr(), strlen(), or similar checks that aren't multi-byte aware and might be thwarted by a %00 in the payload. (Hey, some devs still hold to using regexes to parse and sanitize HTML.)
In principle, I think the two example lines in the question have equal security as far as switching encoding. In practice, there are still plenty of ways to make other mistakes with ambiguous encoding.
From RFC 3629:
10. Security Considerations
Implementers of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.
A particularly subtle form of this attack can be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but erroneously allow the illegal two-octet sequence C0 80 and interpret it as a NUL character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F. This last exploit has actually been used in a widespread virus attacking Web servers in 2001; thus, the security threat is very real.
So it's vitally important to ascertain that your data is valid UTF-8.
But once you have done this, security concerns related to the encoding are minimal.  All HTML special characters are in ASCII, and UTF-8 like ISO-8859-1 is fully ASCII-compatible.  htmlspecialchars will behave the way you expect.
There is more of a concern with non-ASCII-compatible encodings.  For example, in GB18030, the ASCII bytes 0x30 and above can occur within the encoding of a multi-byte character.  The HYPHEN character ‐ (U+2010) is encoded as A9 5C, which includes an ASCII backslash.  This makes it more difficult to properly handle backslash escaping, inviting SQL injection.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With