I want to deserialize a JSON file – which represents a RESTful web service response – into the corresponding classes. I was using
System.Text.ASCIIEncoding.ASCII.GetBytes(ResponseString)
and I read on the Microsoft Docs that using UTF-8 encoding instead of ASCII is better for security reasons.
Now I am a little confused because I don't know the real difference between these two (regarding the security thing). Can anyone show me what the real practical advantages of using UTF-8 over ASCII for deserialization are?
All characters in ASCII can be encoded using UTF-8 without an increase in storage (both requires a byte of storage). UTF-8 has the added benefit of character support beyond "ASCII-characters".
Spatial efficiency is a key advantage of UTF-8 encoding. If instead every Unicode character was represented by four bytes, a text file written in English would be four times the size of the same file encoded with UTF-8. Another benefit of UTF-8 encoding is its backward compatibility with ASCII.
Why did UTF-8 replace the ASCII character-encoding standard? UTF-8 can store a character in more than one byte. UTF-8 replaced the ASCII character-encoding standard because it can store a character in more than a single byte. This allowed us to represent a lot more character types, like emoji.
For characters represented by the 7-bit ASCII character codes, the UTF-8 representation is exactly equivalent to ASCII, allowing transparent round trip migration. Other Unicode characters are represented in UTF-8 by sequences of up to 6 bytes, though most Western European characters require only 2 bytes3.
UTF-8 has the added benefit of character support beyond "ASCII-characters". If that's the case, why will we ever choose ASCII encoding over UTF-8?
In UTF-16, the encoded file size is nearly twice of UTF-8 while encoding ASCII characters. So, UTF-8 is more efficient as it requires less space. UTF-16 is not backward compatible with ASCII where UTF-8 is well compatible.
The only valid criticism to utf-8 is that encodings for common Asian languages are oversized from other encodings. UTF-8 is superior because It is ASCII compatible. Most known and tried string operations do not need adaptation. It is Unicode. Anything that isn't Unicode shouldn't even be considered in this day and age.
This tells the browser that the HTML file is encoded by UTF-8, so that the browser can translate it back to legible text. As I mentioned, UTF-8 is not the only encoding method for Unicode characters — there’s also UTF-16.
Ultimately, the intention of an encoder is to get back the data you were meant to get. ASCII only defines a tiny tiny 7-bit range of values; anything over that isn't handled, and you could get back garbage - or ?
, from payloads that include e̵v̷e̴n̸ ̷r̵e̸m̵o̸t̸e̵l̶y̸ ̶i̴n̴t̵e̵r̷e̵s̶t̶i̷n̷g̵ ̶t̸e̵x̵t̵.
Now; what happens when your application gets data it can't handle? We don't know, and it could indeed quite possibly cause a security problem when you get payloads you can't handle.
It is also just frankly embarrassing in this connected world if you can't correctly store and display the names etc of your customers (or print their name backwards because of right-to-left markers). Most people in the world use things outside of ASCII on a daily basis.
Since UTF-8 is a superset of ASCII, and UTF-8 basically won the encoding war: you might as well just use UTF-8.
Since not every sequence of bytes is a valid encoded string vulnerabilities arise from unwanted transformations which can be exploited by clever attackers.
Let me cite from a black hat whitepaper on Unicode security:
Character encodings and the Unicode standard are also exposed to vulnerability. ... often they’re related to implementation in practical use. ... the following categories can enable vulnerability in applications which are not built to prevent the relevant attacks:
- Visual Spoofing
- Best-fit mappings
- Charset transcodings and character mappings
- Normalization
- Canonicalization of overlong UTF-8
- Over-consumption
- Character substitution
- Character deletion
- Casing
- Buffer overflows
- Controlling Syntax
- Charset mismatches
Consider the following ... example. In the case of U+017F LATIN SMALL LETTER LONG S, the upper casing and normalization operations transform the character into a completely different value. In some situations, this behavior could be exploited to create cross-site scripting or other attack scenarios
... software vulnerabilities arise when best-fit mappings occur. To name a few:
- Best-fit mappings are not reversible, so data is irrevocably lost.
- Characters can be manipulated to bypass string handling filters, such as cross-site scripting (XSS) filters, WAF's, and IDS devices.
- Characters can be manipulated to abuse logic in software. Such as when the characters can be used to access files on the file system. In this case, a best-fit mapping to characters such as ../ or file:// could be damaging.
If you are actually storing binary data consider base64 or hex instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With