Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

write chars into DB with correct character-set

Tags:

c#

oracle

I met a problem writing Chinese characters into an Oracle database. Here's some information for your reference.

  1. Environment: Oracle 8

    select userenv('language') form dual;
    

    returns

    American.America.UTF8
    
  2. Developing:

    .NET2/C#
    
  3. Client characterset:

    gb2312
    

I simply test write value into the table manually, the result is correct and should be as the following:

VALUE: 朋友        //chinese word means 'friend'
DUMP: 197,243,211,209 //caculated by Oracle dump() function

Using code to write, code snap as following:

Encoding def = Encoding.Default;
Encoding utf8 = Encoding.UTF8;
byte[] bytes = def.GetBytes("朋友");
//For debug
//string debug = "";
//foreach(byte b in bytes)
//   debug += b.ToString() + " ";
//Debug.WriteLine(debug); //That will display 197,243,211,209 as the same as the Dump value mentioned
string value = utf8.GetString(bytes);
//I also try, string value = utf8.GetString(Encoding.Convert(def,utf8,bytes))
string sql = String.Format("UPDATE T SET C='{0}' WHERE...",value);
//execute the sql...

After that, the value stored in the DBMS is not right, both value and it's dump result.

like image 242
user622851 Avatar asked Nov 13 '22 16:11

user622851


1 Answers

Decimal Hexadecimal Binary
197     c5          1100 0101
243     f3          1111 0011
211     d3          1101 0011
209     d1          1101 0001

Looking at Wikipedia We see that 197 is the first byte of a two byte sequence, 243 is the first byte of a four byte sequence, 211 is the first byte of a two byte sequence, 208 is the first byte of a two byte sequence. That is not valid UTF-8. Could you tell us what the unicode code points for the two characters 朋友 are?

EDIT Ah, GB2313, c5f3 is Unicode code point u+670b. And d3d1 is u+53cb. (Found using converter at http://demo.icu-project.org/icu-bin/convexp?conv=ibm-1383_P110-1999&ShowLocales&s=ALL#ShowLocales)

Double check the client character set that the Oracle client is using. What I have seen (on Oracle 10gR2) is that if the Oracle's client has the same character encoding as the database server, then the characters will not be translated (because they are the same character set) but they will not be validated. It appears that they were the same at the time of the manual insert and the GB2313 values were inserted for the characters you wanted, which is invalid inside the DB, since it is utf8.

Note, Oracle's "utf8" character set is not full modern UTF-8, but instead CESU-8. Not an issue in this case, since these characters sit on the Basic Multilingual Plane, and have the same encoding in UTF-8 and CESU-8. The earliest reference I could find was for Oracle 8i: http://download.oracle.com/docs/cd/A87860_01/doc/server.817/a76966/appa.htm#971460.

like image 58
Shannon Severance Avatar answered Dec 21 '22 09:12

Shannon Severance