I have an input string in alien coding system, i.e.: "\\U+1043\\U+1072\\U+1073\\U+1072\\U+1088\\U+1080\\U+1090\\U+1085\\U+1086\\U+1089\\U+1090\\U+1100"
And I want to cast it to my default code system(System.Text.Encoding.Default):
- System.Text.Encoding.Default {System.Text.SBCSCodePageEncoding} System.Text.Encoding {System.Text.SBCSCodePageEncoding}
+ [System.Text.SBCSCodePageEncoding] {System.Text.SBCSCodePageEncoding} System.Text.SBCSCodePageEncoding
BodyName "koi8-r" string
CodePage 1251 int
+ DecoderFallback {System.Text.InternalDecoderBestFitFallback} System.Text.DecoderFallback {System.Text.InternalDecoderBestFitFallback}
+ EncoderFallback {System.Text.InternalEncoderBestFitFallback} System.Text.EncoderFallback {System.Text.InternalEncoderBestFitFallback}
EncodingName "Cyrillic (Windows)" string
HeaderName "windows-1251" string
IsBrowserDisplay true bool
IsBrowserSave true bool
IsMailNewsDisplay true bool
IsMailNewsSave true bool
IsReadOnly true bool
IsSingleByte true bool
WebName "windows-1251" string
WindowsCodePage 1251 int
How I could determine code system and how to cast it?
Different computers can use different encodings as the default, and the default encoding can change on a single computer. If you use the Default encoding to encode and decode data streamed between computers or retrieved at different times on the same computer, it may translate that data incorrectly.
Your computer translates the numeric values into visible characters. It does this is by using an encoding standard. An encoding standard is a numbering scheme that assigns each text character in a character set to a numeric value. A character set can include alphabetical characters, numbers, and other symbols.
All strings in a . NET Framework program are stored as 16-bit Unicode characters. At times you might need to convert from Unicode to some other character encoding, or from some other character encoding to Unicode.
I'm not sure if I really understand your question.
In .NET, when you have a string object then you don't need to care about different encodings. All .NET strings use the same encoding: Unicode (or more precisely: UTF-16).
Different text encodings only come into play, when you turn a string object into a byte sequence (e.g. to write it to a text file) or vice versa. I assume you are talking about this. To convert a byte sequence from one encoding to another, you could write:
byte[] input = ReadInput(); // e.g. from a file
Encoding decoder = Encoding.GetEncoding("encoding of input");
string str = decoder.GetString(input);
Encoding encoder = Encoding.GetEncoding("encoding of output");
byte[] ouput = encoder.GetBytes(str);
Of course you need to replace encoding of input
and encoding of output
with proper encoding names. MSDN has a list of all supported encodings.
You need to know the encoding of the input, either by convention or based on metadata or something. You cannot reliably determine/guess an unknown encoding, but there are some tricks and heuristics you could apply. See How can I detect the encoding/codepage of a text file.
Edit:
"U+xxxx" is how you usually refer to a specific Unicode code point (the number assigned to a Unicode character), e.g. the code point of the letter "A" (Latin capital A) is U+0041.
Is your input string actually "\\U+1043...
" (backslash, backslash, capital U etc.) or is it only displayed like this e.g. in a debugger window? If it's the first then somebody made a mistake while encoding the text, maybe by trying to write a Unicode literal and accidentaly escaping the backslash by writing a second one (Edit2: Or the characters were deliberately saved in an escaped way to write them into an ASCII-encoded file/stream/etc). As far as I know, the .NET encoding classes do not help you here; you need to parse the string by hand.
By the way, the numbers in your example are strange. In the standard notation, the number after "U+" is a hex number, not a decimal number. But if you read the code points as hex numbers then they refer to characters from completely unrelated script systems (Burmese, Georgian Mkhedruli, Hangul Jamo); read as decimal numbers they all refer to Cyrillic letters, though.
Edit3: To parse it, well, look for substrings in the form \\U+xxxx
(with x being a digit), convert xxxx
to an int n
, create a char with that code point (Char.ConvertFromUtf32(n)
) and replace the whole substring by that char.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With