Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 String Iterators

I am trying to write a Unicode-supported cross-platform application. I am using the library UTF8-C++ ( http://utfcpp.sourceforge.net/ ) but I am having trouble iterating through a string:

string s1 = "Добрый день";
utf8::iterator<string::iterator> iter(s1.begin(), s1.begin(), s1.end());

for(int i = 0; i < utf8::distance(s1.begin(), s1.end()); i++, ++iter)
{
    cout << (*iter);
}

The above code, when redirected to a UTF-8 formatted text file, produces the following output:

6 3 6 3 6 3 6 3 6 3 6 3 3 2 6 3 6 3 6 3 6 3 

How can I get the content of s1 to appear in the file properly?

like image 339
Qman Avatar asked Aug 23 '12 16:08

Qman


People also ask

Can iterator be used for string?

Iterator – based Approach: The string can be traversed using iterator.

Can UTF-8 support all characters?

UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL). The stated objective of the Unicode consortium is to encompass all communications.

Does STD string support UTF-8?

UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.

What does UTF-8 mean?

UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.


1 Answers

You need to ensure that the string is being initialized with the correct data, and then that the iterator is producing the correct values.

You're using VS2010, so there's a bit of a problem with string literals. C++ implementations have an 'execution character set' to which they convert character and string literals from the 'source character set'. Visual Studio does not support UTF-8 as an execution character set, and therefore will not intentionally produce a UTF-8 encoded string literal.

You can get one by tricking the compiler, or by using hex escapes. Also instead of getting a UTF-8 string literal you could just get a wide string containing the correct data and then convert it at runtime to UTF-8.


edit: More recent versions of Visual Studio do now have ways to get UTF-8 string literals. Visual Studio 2015 now supports C++11's UTF-8 string literals. In Visual Studio 2015 Update 2 you can also use the compiler flags /execution-charset:utf-8 or /utf-8.


Tricking the compiler

If you save the source code as 'UTF-8 without signature' then the compiler will think that the source encoding is the system locale encoding. VS always uses the system locale encoding as the execution encoding. So when it thinks the source and execution encodings are the same it will not perform any conversion and your source bytes, which will actually be UTF-8, will be used directly for the string literal thus producing a UTF-8 encoded string literal. (note that this breaks the conversion done for wide character and string literals.)

Hex escapes

Hex escape codes let you manually insert code units (bytes in this case) of any value into a string literal. You can manually determine the UTF-8 encoding you want and then insert those values into the string literal.

std::string s1 = "\xd0\x94\xd0\xbe\xd0\xb1\xd1\x80\xd1\x8b\xd0\xb9 \xd0\xb4\xd0\xb5\xd0\xbd\xd1\x8c";

UTF-8 string literal prefix

C++11 specifies a prefix that creates a UTF-8 string literal regardless of the execution encoding, however Visual Studio does not implement this yet. This looks like:

string s1 = u8"Добрый день";

It requires that the compiler know and use the correct source encoding (and therefore that the source encoding support the desired string). The compiler then does the conversion from the source encoding to UTF-8 instead of to the execution encoding. When Visual Studio supports this feature you'll probably want to save your source code as 'UTF-8 with signature.' (Again, VS depends on the signature to identify UTF-8 source.)


After you have a UTF-8 string then, assuming the UTF-8 iterator works, your example code should produce the correct 11 code points and I think the output text should look like:

104410861073108810991081321076107710851100

Insert some spaces to make it readable and you can verify that you're getting the right values:

1044 1086 1073 1088 1099 1081 32 1076 1077 1085 1100

Or make it hex and add the Unicode prefix:

U+0414 U+043e U+0431 U+0440 U+044b U+0439 U+0020 U+0434 U+0435 U+043d U+044c

If you actually want to produce a UTF-8 encoded output file then you shouldn't be using the utf-8 iterator anyway.

string s1 = "Добрый день";
std::cout << s1;

When the output is redirected to a file then the file will contain the UTF-8 encoded data:

Добрый день

I don't understand why your actual output currently contains a bunch of extra spaces, but it looks like the actual numbers that are being accessed are:

63 63 63 63 63 63 32 63 63 63 63

63 is the ascii code for '?' and 32 is the ascii code for a space; ?????? ????. So you are clearly suffering from VC++'s conversion of the string literal to the system locale encoding.

like image 179
bames53 Avatar answered Oct 22 '22 20:10

bames53