Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a Unicode code point to characters in C++ using ICU?

Tags:

c++

unicode

icu

Somehow I couldn't find the answer in Google. Probably I'm using the wrong terminology when I'm searching. I'm trying to perform a simple task, convert a number that represents a character to the characters itself like in this table: http://unicode-table.com/en/#0460

For example, if my number is 47 (which is '\'), I can just put 47 in a char and print it using cout and I will see in the console a backslash (there is no problem for numbers lower than 256).

But if my number is 1120, the character should be 'Ѡ' (omega in Latin). I assume it is represented by several characters (which cout would know to convert to 'Ѡ' when it prints to the screen).

How do I get these "several characters" that represent 'Ѡ'?

I have a library called ICU, and I'm using UTF-8.

like image 931
OopsUser Avatar asked Apr 27 '14 10:04

OopsUser


People also ask

Can I use Unicode in C?

It can represent all 1,114,112 Unicode characters. Most C code that deals with strings on a byte-by-byte basis still works, since UTF-8 is fully compatible with 7-bit ASCII.

Does C use Unicode or ASCII?

As far as I know, the standard C's char data type is ASCII, 1 byte (8 bits). It should mean, that it can hold only ASCII characters.

How does C handle Unicode?

By default, C language only prints 8 Bit characters. Note: Unicode is not a function or method in C, so there is no specific syntax to it.

What is Unicode and how does Unicode help with converting characters to numbers?

Unicode provides a unique number for every character, regardless of platform, language, or program. Using Unicode, you can develop a software product that works with various platforms, languages, and countries. Unicode also allows data to be transported through many different systems.


2 Answers

What you call Unicode number is typically called a code point. If you want to work with C++ and Unicode strings, ICU offers a icu::UnicodeString class. You can find the documentation here.

To create a UnicodeString holding a single character, you can use the constructor that takes a code point in a UChar32:

icu::UnicodeString::UnicodeString(UChar32 ch)

Then you can call the toUTF8String method to convert the string to UTF-8.

Example program:

#include <iostream>
#include <string>

#include <unicode/unistr.h>

int main() {
    icu::UnicodeString uni_str((UChar32)1120);
    std::string str;
    uni_str.toUTF8String(str);
    std::cout << str << std::endl;

    return 0;
}

On a Linux system like Debian, you can compile this program with:

g++ so.cc -o so -licuuc

If your terminal supports UTF-8, this will print an omega character.

like image 78
nwellnhof Avatar answered Oct 06 '22 13:10

nwellnhof


note: if you have an error: 'undefined reference to icudt67_dat' you need to link -licudt then your problem will be solved.

like image 24
krak'175 Avatar answered Oct 06 '22 13:10

krak'175