Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to know string encoding in C#

I am getting a string from a third party program that I don't control. My piece of the code outputs this in HTML. This works fine in English, but in other languages it will show in a funny way. For example, accents in Spanish look funny and characters in eastern languages (i.e. korean) will look very funny. I am pretty sure I need to do some encoding work so that all languages display correctly.

My understanding of encoding is kind of poor, so before posting the real question, which I intuitively think it is: "How do I encode this to UTF-8 in C#", I would like to get more understanding on the matter by posting simpler questions.

My question here is: How do I know which type of encoding does my input string has? In Spanish, it looks like this when I get an accent: "Acción", instead of "Acción". Is this ANSI or what am I dealing with?

Thanks a lot in advance!

like image 429
Gaara Avatar asked Mar 08 '26 07:03

Gaara


1 Answers

I get an accent: "Acción"

The presence of the à character is a dead give-away. Accented capital A characters have character code 0xC0 and up. Which is often the first byte in a two-byte utf-8 encoded character. The ó glyph is codepoint U+00F3, the utf-8 encoding for it is 0xC3 + 0xB3. Which are the codepoints for à and ³

The strings are encoded in utf-8 but you are reading it with an 8-bit encoding like Encoding.Default

like image 69
Hans Passant Avatar answered Mar 10 '26 05:03

Hans Passant