Is there any way to determine a byte array's encoding in C#?
I have any string, like "Lorem ipsum áéíóú ñÑç", and I get bytes array using several encodings.
I would like a only method for detect encoding in byte array and I get string value again.
Other issue, maybe, I'll have a column in database which store BLOB (like byte array). A string previously converted to byte array in UTF-8. Maybe another application converts a string to byte array using Unicode encoding.
In a database column there are byte arrays in several encodings. It would be very useful detect byte array's encoding. I need a way to find encoding of byte array.
Tests:
string DataXmlForSupport = "<support><machinename></machinename><comments>Este es el log 1 áéíóú</comments></support>";
string DataXmlForSupport2 = "Lorem ipsum áéíóú ñÑç";
[TestMethod]
public void Encoding_byte_array_string()
{
var uencoding = new System.Text.UnicodeEncoding();
byte[] data = uencoding.GetBytes(DataXmlForSupport);
var dataXml = Encoding.Unicode.GetString(data);
Assert.AreEqual(DataXmlForSupport, dataXml, "Se esperaba resultados Unicode");
dataXml = Encoding.UTF8.GetString(data);
Assert.AreNotEqual(DataXmlForSupport, dataXml, "NO Se esperaba resultados UTF8");
var utf8 = new System.Text.UTF8Encoding();
data = utf8.GetBytes(DataXmlForSupport2);
dataXml = Encoding.UTF8.GetString(data);
Assert.AreEqual(DataXmlForSupport2, dataXml, "Se esperaba resultados UTF8");
dataXml = Encoding.Unicode.GetString(data);
Assert.AreNotEqual(DataXmlForSupport2, dataXml, "NO Se esperaba resultados Unicode");
}
In PHP, mb_detect_encoding() is used to detect the character encoding. It can detect the character encoding for a string from an ordered list of candidates. This function is supported in PHP 4.0. 6 or higher version.
A byte array is simply a collection of bytes. The bytearray() method returns a bytearray object, which is an array of the specified bytes. The bytearray class is a mutable array of numbers ranging from 0 to 256.
Text in String instances is stored using Unicode 16. You can include specific Unicode characters in a String using the syntax \u03a0 (here it is the pi character for exemple).
In short, no. Please see How to detect the character encoding of a text file? for a detailed answer on various encodings and why they can't be automatically determined.
Your best solution is to convert the string from it's original encoding to UTF8 and convert that to a byte array. Then you'll know your byte array's encoding...
I realize I'm late to the party here, but I just had a need to do this very thing and found a good way to do it:
byte[] data; // Populate this however you see fit with your data
string text;
Encoding enc;
using (StreamReader reader = new StreamReader(new MemoryStream(data),
detectEncodingFromByteOrderMarks: true))
{
text = reader.ReadToEnd();
enc = reader.CurrentEncoding; // the reader detects the encoding for you!
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With