Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is using UTF-8 for encoding during deserialization better than ASCII?

I want to deserialize a JSON file – which represents a RESTful web service response – into the corresponding classes. I was using System.Text.ASCIIEncoding.ASCII.GetBytes(ResponseString) and I read on the Microsoft Docs that using UTF-8 encoding instead of ASCII is better for security reasons.

Now I am a little confused because I don't know the real difference between these two (regarding the security thing). Can anyone show me what the real practical advantages of using UTF-8 over ASCII for deserialization are?

like image 580
Warios Avatar asked Mar 05 '20 10:03

Warios


People also ask

Which is better ASCII or UTF-8?

All characters in ASCII can be encoded using UTF-8 without an increase in storage (both requires a byte of storage). UTF-8 has the added benefit of character support beyond "ASCII-characters".

What is the advantage of UTF-8?

Spatial efficiency is a key advantage of UTF-8 encoding. If instead every Unicode character was represented by four bytes, a text file written in English would be four times the size of the same file encoded with UTF-8. Another benefit of UTF-8 encoding is its backward compatibility with ASCII.

Why did UTF-8 replace the ASCII character and coding standard?

Why did UTF-8 replace the ASCII character-encoding standard? UTF-8 can store a character in more than one byte. UTF-8 replaced the ASCII character-encoding standard because it can store a character in more than a single byte. This allowed us to represent a lot more character types, like emoji.

Is UTF-8 the same as ASCII?

For characters represented by the 7-bit ASCII character codes, the UTF-8 representation is exactly equivalent to ASCII, allowing transparent round trip migration. Other Unicode characters are represented in UTF-8 by sequences of up to 6 bytes, though most Western European characters require only 2 bytes3.

Why use UTF-8 instead of ASCII?

UTF-8 has the added benefit of character support beyond "ASCII-characters". If that's the case, why will we ever choose ASCII encoding over UTF-8?

What is the difference between UTF 8 and UTF 16?

In UTF-16, the encoded file size is nearly twice of UTF-8 while encoding ASCII characters. So, UTF-8 is more efficient as it requires less space. UTF-16 is not backward compatible with ASCII where UTF-8 is well compatible.

Is there any valid criticism of UTF-8?

The only valid criticism to utf-8 is that encodings for common Asian languages are oversized from other encodings. UTF-8 is superior because It is ASCII compatible. Most known and tried string operations do not need adaptation. It is Unicode. Anything that isn't Unicode shouldn't even be considered in this day and age.

What does it mean when HTML is encoded as UTF-8?

This tells the browser that the HTML file is encoded by UTF-8, so that the browser can translate it back to legible text. As I mentioned, UTF-8 is not the only encoding method for Unicode characters — there’s also UTF-16.


2 Answers

Ultimately, the intention of an encoder is to get back the data you were meant to get. ASCII only defines a tiny tiny 7-bit range of values; anything over that isn't handled, and you could get back garbage - or ?, from payloads that include e̵v̷e̴n̸ ̷r̵e̸m̵o̸t̸e̵l̶y̸ ̶i̴n̴t̵e̵r̷e̵s̶t̶i̷n̷g̵ ̶t̸e̵x̵t̵.

Now; what happens when your application gets data it can't handle? We don't know, and it could indeed quite possibly cause a security problem when you get payloads you can't handle.

It is also just frankly embarrassing in this connected world if you can't correctly store and display the names etc of your customers (or print their name backwards because of right-to-left markers). Most people in the world use things outside of ASCII on a daily basis.

Since UTF-8 is a superset of ASCII, and UTF-8 basically won the encoding war: you might as well just use UTF-8.

like image 75
Marc Gravell Avatar answered Oct 22 '22 23:10

Marc Gravell


Since not every sequence of bytes is a valid encoded string vulnerabilities arise from unwanted transformations which can be exploited by clever attackers.

Let me cite from a black hat whitepaper on Unicode security:

Character encodings and the Unicode standard are also exposed to vulnerability. ... often they’re related to implementation in practical use. ... the following categories can enable vulnerability in applications which are not built to prevent the relevant attacks:

  • Visual Spoofing 
  • Best-fit mappings
  • Charset transcodings and character mappings
  • Normalization
  • Canonicalization of overlong UTF-8
  • Over-consumption
  • Character substitution
  • Character deletion
  • Casing
  • Buffer overflows
  • Controlling Syntax
  • Charset mismatches

Consider the following ... example. In the case of U+017F LATIN SMALL LETTER LONG S, the upper casing and normalization operations transform the character into a completely different value. In some situations, this behavior could be exploited to create cross-site scripting or other attack scenarios

... software vulnerabilities arise when best-fit mappings occur. To name a few:

  • Best-fit mappings are not reversible, so data is irrevocably lost.
  • Characters can be manipulated to bypass string handling filters, such as cross-site scripting (XSS) filters, WAF's, and IDS devices.
  • Characters can be manipulated to abuse logic in software. Such as when the characters can be used to access files on the file system. In this case, a best-fit mapping to characters such as ../ or file:// could be damaging.

If you are actually storing binary data consider base64 or hex instead.

like image 35
wp78de Avatar answered Oct 23 '22 00:10

wp78de