I met a problem writing Chinese characters into an Oracle database. Here's some information for your reference.
Environment: Oracle 8
select userenv('language') form dual;
returns
American.America.UTF8
Developing:
.NET2/C#
Client characterset:
gb2312
I simply test write value into the table manually, the result is correct and should be as the following:
VALUE: 朋友 //chinese word means 'friend'
DUMP: 197,243,211,209 //caculated by Oracle dump() function
Using code to write, code snap as following:
Encoding def = Encoding.Default;
Encoding utf8 = Encoding.UTF8;
byte[] bytes = def.GetBytes("朋友");
//For debug
//string debug = "";
//foreach(byte b in bytes)
// debug += b.ToString() + " ";
//Debug.WriteLine(debug); //That will display 197,243,211,209 as the same as the Dump value mentioned
string value = utf8.GetString(bytes);
//I also try, string value = utf8.GetString(Encoding.Convert(def,utf8,bytes))
string sql = String.Format("UPDATE T SET C='{0}' WHERE...",value);
//execute the sql...
After that, the value stored in the DBMS is not right, both value and it's dump result.
Decimal Hexadecimal Binary
197 c5 1100 0101
243 f3 1111 0011
211 d3 1101 0011
209 d1 1101 0001
Looking at Wikipedia We see that 197 is the first byte of a two byte sequence, 243 is the first byte of a four byte sequence, 211 is the first byte of a two byte sequence, 208 is the first byte of a two byte sequence. That is not valid UTF-8. Could you tell us what the unicode code points for the two characters 朋友 are?
EDIT Ah, GB2313, c5f3 is Unicode code point u+670b. And d3d1 is u+53cb. (Found using converter at http://demo.icu-project.org/icu-bin/convexp?conv=ibm-1383_P110-1999&ShowLocales&s=ALL#ShowLocales)
Double check the client character set that the Oracle client is using. What I have seen (on Oracle 10gR2) is that if the Oracle's client has the same character encoding as the database server, then the characters will not be translated (because they are the same character set) but they will not be validated. It appears that they were the same at the time of the manual insert and the GB2313 values were inserted for the characters you wanted, which is invalid inside the DB, since it is utf8.
Note, Oracle's "utf8" character set is not full modern UTF-8, but instead CESU-8. Not an issue in this case, since these characters sit on the Basic Multilingual Plane, and have the same encoding in UTF-8 and CESU-8. The earliest reference I could find was for Oracle 8i: http://download.oracle.com/docs/cd/A87860_01/doc/server.817/a76966/appa.htm#971460.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With