Why is using UTF-8 for encoding during deserialization better than ASCII?

Tags:

I want to deserialize a JSON file – which represents a RESTful web service response – into the corresponding classes. I was using System.Text.ASCIIEncoding.ASCII.GetBytes(ResponseString) and I read on the Microsoft Docs that using UTF-8 encoding instead of ASCII is better for security reasons.

Now I am a little confused because I don't know the real difference between these two (regarding the security thing). Can anyone show me what the real practical advantages of using UTF-8 over ASCII for deserialization are?

580

asked Mar 05 '20 10:03

Warios

2 Answers

Ultimately, the intention of an encoder is to get back the data you were meant to get. ASCII only defines a tiny tiny 7-bit range of values; anything over that isn't handled, and you could get back garbage - or ?, from payloads that include e̵v̷e̴n̸ ̷r̵e̸m̵o̸t̸e̵l̶y̸ ̶i̴n̴t̵e̵r̷e̵s̶t̶i̷n̷g̵ ̶t̸e̵x̵t̵.

Now; what happens when your application gets data it can't handle? We don't know, and it could indeed quite possibly cause a security problem when you get payloads you can't handle.

It is also just frankly embarrassing in this connected world if you can't correctly store and display the names etc of your customers (or print their name backwards because of right-to-left markers). Most people in the world use things outside of ASCII on a daily basis.

Since UTF-8 is a superset of ASCII, and UTF-8 basically won the encoding war: you might as well just use UTF-8.

answered Oct 22 '22 23:10

Marc Gravell

Since not every sequence of bytes is a valid encoded string vulnerabilities arise from unwanted transformations which can be exploited by clever attackers.

Let me cite from a black hat whitepaper on Unicode security:

Character encodings and the Unicode standard are also exposed to vulnerability. ... often they’re related to implementation in practical use. ... the following categories can enable vulnerability in applications which are not built to prevent the relevant attacks:

Visual Spoofing 

Best-fit mappings

Charset transcodings and character mappings

Normalization

Canonicalization of overlong UTF-8

Over-consumption

Character substitution

Character deletion

Casing

Buffer overflows

Controlling Syntax

Charset mismatches

Consider the following ... example. In the case of U+017F LATIN SMALL LETTER LONG S, the upper casing and normalization operations transform the character into a completely different value. In some situations, this behavior could be exploited to create cross-site scripting or other attack scenarios

... software vulnerabilities arise when best-fit mappings occur. To name a few:

Best-fit mappings are not reversible, so data is irrevocably lost.

Characters can be manipulated to bypass string handling filters, such as cross-site scripting (XSS) filters, WAF's, and IDS devices.

Characters can be manipulated to abuse logic in software. Such as when the characters can be used to access files on the file system. In this case, a best-fit mapping to characters such as ../ or file:// could be damaging.

If you are actually storing binary data consider base64 or hex instead.

answered Oct 23 '22 00:10

wp78de

Related questions
                            
                                How to download an Azure BLOB Storage file via URL
                            
                                Requesting to github-api using .net HttpClient says Forbidden
                            
                                How is var different than other keywords? [duplicate]
                            
                                Why is IHostedService.StopAsync called twice when calling IHost.StopAsync?
                            
                                How to deny anonymous users in asp.net core razor pages?
                            
                                How to turn EF Core warnings about locally evaluated expressions to errors?
                            
                                ASP.NET Core routing engine confusion
                            
                                How to convert UUID/GUID to OID/DICOM UID in JavaScript?
                            
                                How to setup readonly collection property with backing field in EF Core 2.2
                            
                                Enumerating List faster than IList, ICollection and IEnumerable
                            
                                Is there a way to get a file stream to download to the browser in Blazor?
                            
                                Unit test cosmosDb methods using Moq
                            
                                Getting "Metadata generation failed" when building C# function app in Visual Studio 2019
                            
                                Pass Image object as a parameter from C# to Python
                            
                                Remote debugging - Windows PDB are not supported by .NET Core debugger - how to publish correctly?
                            
                                Error in publishing project with MSBuild after upgrading from .Net Framework 4.7 to 4.8
                            
                                Carousel in adaptive card
                            
                                HttpContext.SignInAsync() fails to set cookie and return User.Identity.IsAuthenticated as true
                            
                                How to compress an image from IFormFile before upload to the server ASP.Net Core
                            
                                CSVHelper does not parse my Tab delimited CSV file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is using UTF-8 for encoding during deserialization better than ASCII?

Tags:

c#

encoding

deserialization

Warios

People also ask

2 Answers

Marc Gravell

wp78de

Recent Activity

Donate For Us