Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

determine text code type and cast to default

Tags:

c#

.net

f#

I have an input string in alien coding system, i.e.: "\\U+1043\\U+1072\\U+1073\\U+1072\\U+1088\\U+1080\\U+1090\\U+1085\\U+1086\\U+1089\\U+1090\\U+1100"

And I want to cast it to my default code system(System.Text.Encoding.Default):

-       System.Text.Encoding.Default    {System.Text.SBCSCodePageEncoding}  System.Text.Encoding {System.Text.SBCSCodePageEncoding}
+       [System.Text.SBCSCodePageEncoding]  {System.Text.SBCSCodePageEncoding}  System.Text.SBCSCodePageEncoding
        BodyName    "koi8-r"    string
        CodePage    1251    int
+       DecoderFallback {System.Text.InternalDecoderBestFitFallback}    System.Text.DecoderFallback {System.Text.InternalDecoderBestFitFallback}
+       EncoderFallback {System.Text.InternalEncoderBestFitFallback}    System.Text.EncoderFallback {System.Text.InternalEncoderBestFitFallback}
        EncodingName    "Cyrillic (Windows)"    string
        HeaderName  "windows-1251"  string
        IsBrowserDisplay    true    bool
        IsBrowserSave   true    bool
        IsMailNewsDisplay   true    bool
        IsMailNewsSave  true    bool
        IsReadOnly  true    bool
        IsSingleByte    true    bool
        WebName "windows-1251"  string
        WindowsCodePage 1251    int

How I could determine code system and how to cast it?

like image 395
RomanKovalev Avatar asked Nov 29 '12 10:11

RomanKovalev


People also ask

What is encoding default?

Different computers can use different encodings as the default, and the default encoding can change on a single computer. If you use the Default encoding to encode and decode data streamed between computers or retrieved at different times on the same computer, it may translate that data incorrectly.

How is text encoded?

Your computer translates the numeric values into visible characters. It does this is by using an encoding standard. An encoding standard is a numbering scheme that assigns each text character in a character set to a numeric value. A character set can include alphabetical characters, numbers, and other symbols.

What is the encoding of string in C#?

All strings in a . NET Framework program are stored as 16-bit Unicode characters. At times you might need to convert from Unicode to some other character encoding, or from some other character encoding to Unicode.


1 Answers

I'm not sure if I really understand your question.

In .NET, when you have a string object then you don't need to care about different encodings. All .NET strings use the same encoding: Unicode (or more precisely: UTF-16).

Different text encodings only come into play, when you turn a string object into a byte sequence (e.g. to write it to a text file) or vice versa. I assume you are talking about this. To convert a byte sequence from one encoding to another, you could write:

byte[] input = ReadInput(); // e.g. from a file
Encoding decoder = Encoding.GetEncoding("encoding of input");
string str = decoder.GetString(input);
Encoding encoder = Encoding.GetEncoding("encoding of output");
byte[] ouput = encoder.GetBytes(str);

Of course you need to replace encoding of input and encoding of output with proper encoding names. MSDN has a list of all supported encodings.

You need to know the encoding of the input, either by convention or based on metadata or something. You cannot reliably determine/guess an unknown encoding, but there are some tricks and heuristics you could apply. See How can I detect the encoding/codepage of a text file.

Edit:

"U+xxxx" is how you usually refer to a specific Unicode code point (the number assigned to a Unicode character), e.g. the code point of the letter "A" (Latin capital A) is U+0041.

Is your input string actually "\\U+1043..." (backslash, backslash, capital U etc.) or is it only displayed like this e.g. in a debugger window? If it's the first then somebody made a mistake while encoding the text, maybe by trying to write a Unicode literal and accidentaly escaping the backslash by writing a second one (Edit2: Or the characters were deliberately saved in an escaped way to write them into an ASCII-encoded file/stream/etc). As far as I know, the .NET encoding classes do not help you here; you need to parse the string by hand.

By the way, the numbers in your example are strange. In the standard notation, the number after "U+" is a hex number, not a decimal number. But if you read the code points as hex numbers then they refer to characters from completely unrelated script systems (Burmese, Georgian Mkhedruli, Hangul Jamo); read as decimal numbers they all refer to Cyrillic letters, though.

Edit3: To parse it, well, look for substrings in the form \\U+xxxx (with x being a digit), convert xxxx to an int n, create a char with that code point (Char.ConvertFromUtf32(n)) and replace the whole substring by that char.

like image 58
Sebastian Negraszus Avatar answered Oct 18 '22 19:10

Sebastian Negraszus