Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to define/declare utf-8 code points for Turkish special chars (non-ascii) to use them as standard utf-8 encoding?

Tags:

encoding

utf-8

Türkish chars 'ÇçĞğİıÖöŞşÜü' are not handled correctly in utf-8 encoding altough they all seem to be defined. Charcodes of all of them is 65533 (replacemnt character, possibly for error display) in usage and a question mark or box is displayed depending on the selected font. In some cases 0/null is returned as charcode. On the internet, there are lots of tools which give utf-8 definitions of them but I am not sure if tools use any defined (real/international) registry or dynamicly create the definition with known rules and calculations. Fonts for them are well-defined and no problem to display them when we enter code points manually. This proves that they are defined in utf-8. But on the other hand they are not handled in encodings or tranaformations such as ajax requests/responses.

So the base question is "HOW CAN WE DEFINE A CODEPOINT FOR A CHAR"? The question may be tailored as follows to prevent mis-conception. Suppose we have prepared the encoding data for "Ç" like this -> Character : Ç Character name : LATIN CAPITAL LETTER C WITH CEDILLA Hex code point : 00C7 Decimal code point : 199 Hex UTF-8 bytes : C387 ...... Where/How can we save this info to be a standard utf-8 char? How can we distribute/expose it (make ready to be used by others) ? Do we need any confirmation by anybody/foundation (like unicode/utf-8 consortium) How can we detect/fixup errors if they are already registered but not working correctly? Can we have custom-utf8 configuration? If yes how?

Note : No code snippet is needed here as it is not mis-usage problem.

like image 704
İlhan ÇELİK Avatar asked Feb 04 '13 03:02

İlhan ÇELİK


1 Answers

The charcters you mention are present in Unicode. Here are their character codes in hexadecimal and how they are encoded in UTF-8:

      Ç     ç     Ğ     ğ     İ     ı     Ö     ö     Ş     ş     Ü     ü
Code: 00c7  00e7  011e  011f  0130  0131  00d6  00f6  015e  015f  00dc  00fc
UTF8: c3 87 c3 a7 c4 9e c4 9f c4 b0 c4 b1 c3 96 c3 b6 c5 9e c5 9f c3 9c c3 bc

This means that if you write for example the bytes 0xc4 0x9e into a file you have written the character Ğ, and any software tool that understands UTF-8 must read it back as Ğ.

Update: For correct alphabetic order and case conversions in Turkish you have to use a library that understand locales, just like for any other natural language. For example in Java:

Locale tr = new Locale("TR","tr");     //    Turkish locale
print("ÇçĞğİıÖöŞşÜü".toUpperCase(tr)); //    ÇÇĞĞİIÖÖŞŞÜÜ
print("ÇçĞğİıÖöŞşÜü".toLowerCase(tr)); //    ççğğiıööşşüü

Notice how i in uppercase becomes İ, and I in lowercase becomes ı. You don't say which programming language you use but surely its standard library supports locales, too.

Unicode defines the code points and certain properties for each character (for example, if it's a digit or a letter, for a letter if it's uppercase, lowercase, or titlecase), and certain generic algorithms for dealing with Unicode text (e.g. how to mix right-to-left text and left-to-right text). Alphabetic order and correct case conversion are defined by national standardization bodies, like Institute of Languages of Finland in Finland, Real Academia Española in Spain, independent of Unicode.

Update 2:

The test ((ch&0x20)==ch) for lower case is broken for most languages in the world, not just Turkish. So is the algorithm for converting upper case to lower case you mention. Also, the test for being a letter is incorrect: in many languages Z is not the last letter of the alphabet. To work with text correctly you must use library functions that have been written by people who know what they are doing.

Unicode is supposed to be universal. Creating national and language specific variants of encodings is what lead us to the mess that Unicode is trying to solve. Unfortunately there is no universal standard for ordering characters. For example in English a = ä < z, but in Swedish a < z < ä. In German Ü is equivalent to U by one standard, and to UE by another. In Finnish Ü = Y. There is no way to order code points so that the ordering would be correct in every language.

like image 115
Joni Avatar answered Oct 15 '22 20:10

Joni