I am trying to read a String in UTF-16 encoding scheme and perform MD5 hashing on it. But strangely, Java and C# are returning different results when I try to do it.
The following is the piece of code in Java:
public static void main(String[] args) {
String str = "preparar mantecado con coca cola";
try {
MessageDigest digest = MessageDigest.getInstance("MD5");
digest.update(str.getBytes("UTF-16"));
byte[] hash = digest.digest();
String output = "";
for(byte b: hash){
output += Integer.toString( ( b & 0xff ) + 0x100, 16).substring( 1 );
}
System.out.println(output);
} catch (Exception e) {
}
}
The output for this is: 249ece65145dca34ed310445758e5504
The following is the piece of code in C#:
public static string GetMD5Hash()
{
string input = "preparar mantecado con coca cola";
System.Security.Cryptography.MD5CryptoServiceProvider x = new System.Security.Cryptography.MD5CryptoServiceProvider();
byte[] bs = System.Text.Encoding.Unicode.GetBytes(input);
bs = x.ComputeHash(bs);
System.Text.StringBuilder s = new System.Text.StringBuilder();
foreach (byte b in bs)
{
s.Append(b.ToString("x2").ToLower());
}
string output= s.ToString();
Console.WriteLine(output);
}
The output for this is: c04d0f518ba2555977fa1ed7f93ae2b3
I am not sure, why the outputs are not the same. How do we change the above piece of code, so that both of them return the same output?
The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.
UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units.
UTF-16 supports all languages that are supported by Unicode, though some characters are not actually encoded with 16 bits but with 32 bits. And strictly speaking Unicode does nor support languages but scripts.
Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that could hold any character.
UTF-16 != UTF-16.
In Java, getBytes("UTF-16")
returns an a big-endian representation with optional byte-ordering mark. C#'s System.Text.Encoding.Unicode.GetBytes
returns a little-endian representation. I can't check your code from here, but I think you'll need to specify the conversion precisely.
Try getBytes("UTF-16LE")
in the Java version.
The first thing I can find, and this might not be the only problem, is that C#'s Encoding.Unicode.GetBytes() is littleendian, while Java's natural byte order is bigendian.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With