Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which subset of Unicode symbols should I use to mark special substrings in text?

Our application sends strings which then shall be localized on client side. Sometimes those are whole strings, sometimes only substring, so we have to mark them. It would be the best if it only used Unicode as it wouldn't require any protocol changes.

Example:

"Length: (mark)10(mark)"

where 10 is length in cm but it should be converted so it is displayed as inches or mm.
Are Unicode special characters (0xFFF0-0xFFFF) right choice for marking such special substrings in text?

like image 888
Kubuxu Avatar asked Aug 16 '14 19:08

Kubuxu


People also ask

How do you type special Unicode characters?

Inserting Unicode characters To insert a Unicode character, type the character code, press ALT, and then press X. For example, to type a dollar symbol ($), type 0024, press ALT, and then press X. For more Unicode character codes, see Unicode character code charts by script.

Which method are used for finding the Unicode of a character?

We can determine the unicode category for a particular character by using the getType() method. It is a static method of Character class and it returns an integer value of char ch representing in unicode general category.

What is a Unicode string?

Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.

How do you find the Unicode value of a string in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.


1 Answers

No, code points in the Specials block have their own uses. Using them for other purposes may result in unexpected effects. Even if you code all the processing yourself, the incoming data might contain those code points. It is of course possible to detect them and filter them out, but it is better to use code points that cannot clash with any assigned code points.

Use code points in the range U+FDD0..U+FDEF. They are designated as “noncharacters” and intended for use inside an application. See the Unicode FAQ section Private-Use Characters, Noncharacters & Sentinels FAQ.

like image 108
Jukka K. Korpela Avatar answered Sep 21 '22 18:09

Jukka K. Korpela