I am trying to write a Unicode-supported cross-platform application. I am using the library UTF8-C++ ( http://utfcpp.sourceforge.net/ ) but I am having trouble iterating through a string: <pre class="prettyprint"><code>string s1 = "Добрый день"; utf8::iterator<string::iterator> iter(s1.begin(), s1.begin(), s1.end()); for(int i = 0; i < utf8::distance(s1.begin(), s1.end()); i++, ++iter) { cout << (*iter); } </code></pre> The above code, when redirected to a UTF-8 formatted text file, produces the following output: <pre class="prettyprint"><code>6 3 6 3 6 3 6 3 6 3 6 3 3 2 6 3 6 3 6 3 6 3 </code></pre> How can I get the content of <code>s1</code> to appear in the file properly?

You need to ensure that the string is being initialized with the correct data, and then that the iterator is producing the correct values. You're using VS2010, so there's a bit of a problem with string literals. C++ implementations have an 'execution character set' to which they convert character and string literals from the 'source character set'. Visual Studio does not support UTF-8 as an execution character set, and therefore will not intentionally produce a UTF-8 encoded string literal. You can get one by tricking the compiler, or by using hex escapes. Also instead of getting a UTF-8 string literal you could just get a wide string containing the correct data and then convert it at runtime to UTF-8. <hr> edit: More recent versions of Visual Studio do now have ways to get UTF-8 string literals. Visual Studio 2015 now supports C++11's UTF-8 string literals. In Visual Studio 2015 Update 2 you can also use the compiler flags /execution-charset:utf-8 or /utf-8. <hr> <h3>Tricking the compiler</h3> If you save the source code as 'UTF-8 without signature' then the compiler will think that the source encoding is the system locale encoding. VS always uses the system locale encoding as the execution encoding. So when it thinks the source and execution encodings are the same it will not perform any conversion and your source bytes, which will actually be UTF-8, will be used directly for the string literal thus producing a UTF-8 encoded string literal. (note that this breaks the conversion done for wide character and string literals.) <h3>Hex escapes</h3> Hex escape codes let you manually insert code units (bytes in this case) of any value into a string literal. You can manually determine the UTF-8 encoding you want and then insert those values into the string literal. <pre class="prettyprint"><code>std::string s1 = "\xd0\x94\xd0\xbe\xd0\xb1\xd1\x80\xd1\x8b\xd0\xb9 \xd0\xb4\xd0\xb5\xd0\xbd\xd1\x8c"; </code></pre> <h3>UTF-8 string literal prefix</h3> C++11 specifies a prefix that creates a UTF-8 string literal regardless of the execution encoding, however Visual Studio does not implement this yet. This looks like: <pre class="prettyprint"><code>string s1 = u8"Добрый день"; </code></pre> It requires that the compiler know and use the correct source encoding (and therefore that the source encoding support the desired string). The compiler then does the conversion from the source encoding to UTF-8 instead of to the execution encoding. When Visual Studio supports this feature you'll probably want to save your source code as 'UTF-8 with signature.' (Again, VS depends on the signature to identify UTF-8 source.) <hr> After you have a UTF-8 string then, assuming the UTF-8 iterator works, your example code should produce the correct 11 code points and I think the output text should look like: <pre class="prettyprint"><code>104410861073108810991081321076107710851100 </code></pre> Insert some spaces to make it readable and you can verify that you're getting the right values: <pre class="prettyprint"><code>1044 1086 1073 1088 1099 1081 32 1076 1077 1085 1100 </code></pre> Or make it hex and add the Unicode prefix: <pre class="prettyprint"><code>U+0414 U+043e U+0431 U+0440 U+044b U+0439 U+0020 U+0434 U+0435 U+043d U+044c </code></pre> If you actually want to produce a UTF-8 encoded output file then you shouldn't be using the utf-8 iterator anyway. <pre class="prettyprint"><code>string s1 = "Добрый день"; std::cout << s1; </code></pre> When the output is redirected to a file then the file will contain the UTF-8 encoded data: <pre class="prettyprint"><code>Добрый день </code></pre> <hr> I don't understand why your actual output currently contains a bunch of extra spaces, but it looks like the actual numbers that are being accessed are: <pre class="prettyprint"><code>63 63 63 63 63 63 32 63 63 63 63 </code></pre> 63 is the ascii code for '?' and 32 is the ascii code for a space; <code>?????? ????</code>. So you are clearly suffering from VC++'s conversion of the string literal to the system locale encoding.

UTF-8 String Iterators

Tags:

c++

iterator

string

unicode

utf-8

I am trying to write a Unicode-supported cross-platform application. I am using the library UTF8-C++ ( http://utfcpp.sourceforge.net/ ) but I am having trouble iterating through a string:

string s1 = "Добрый день";
utf8::iterator<string::iterator> iter(s1.begin(), s1.begin(), s1.end());

for(int i = 0; i < utf8::distance(s1.begin(), s1.end()); i++, ++iter)
{
    cout << (*iter);
}

The above code, when redirected to a UTF-8 formatted text file, produces the following output:

6 3 6 3 6 3 6 3 6 3 6 3 3 2 6 3 6 3 6 3 6 3

How can I get the content of s1 to appear in the file properly?

339

asked Aug 23 '12 16:08

Qman

1 Answers

You need to ensure that the string is being initialized with the correct data, and then that the iterator is producing the correct values.

You're using VS2010, so there's a bit of a problem with string literals. C++ implementations have an 'execution character set' to which they convert character and string literals from the 'source character set'. Visual Studio does not support UTF-8 as an execution character set, and therefore will not intentionally produce a UTF-8 encoded string literal.

You can get one by tricking the compiler, or by using hex escapes. Also instead of getting a UTF-8 string literal you could just get a wide string containing the correct data and then convert it at runtime to UTF-8.

edit: More recent versions of Visual Studio do now have ways to get UTF-8 string literals. Visual Studio 2015 now supports C++11's UTF-8 string literals. In Visual Studio 2015 Update 2 you can also use the compiler flags /execution-charset:utf-8 or /utf-8.

Tricking the compiler

If you save the source code as 'UTF-8 without signature' then the compiler will think that the source encoding is the system locale encoding. VS always uses the system locale encoding as the execution encoding. So when it thinks the source and execution encodings are the same it will not perform any conversion and your source bytes, which will actually be UTF-8, will be used directly for the string literal thus producing a UTF-8 encoded string literal. (note that this breaks the conversion done for wide character and string literals.)

Hex escapes

Hex escape codes let you manually insert code units (bytes in this case) of any value into a string literal. You can manually determine the UTF-8 encoding you want and then insert those values into the string literal.

std::string s1 = "\xd0\x94\xd0\xbe\xd0\xb1\xd1\x80\xd1\x8b\xd0\xb9 \xd0\xb4\xd0\xb5\xd0\xbd\xd1\x8c";

UTF-8 string literal prefix

C++11 specifies a prefix that creates a UTF-8 string literal regardless of the execution encoding, however Visual Studio does not implement this yet. This looks like:

string s1 = u8"Добрый день";

It requires that the compiler know and use the correct source encoding (and therefore that the source encoding support the desired string). The compiler then does the conversion from the source encoding to UTF-8 instead of to the execution encoding. When Visual Studio supports this feature you'll probably want to save your source code as 'UTF-8 with signature.' (Again, VS depends on the signature to identify UTF-8 source.)

After you have a UTF-8 string then, assuming the UTF-8 iterator works, your example code should produce the correct 11 code points and I think the output text should look like:

104410861073108810991081321076107710851100

Insert some spaces to make it readable and you can verify that you're getting the right values:

1044 1086 1073 1088 1099 1081 32 1076 1077 1085 1100

Or make it hex and add the Unicode prefix:

U+0414 U+043e U+0431 U+0440 U+044b U+0439 U+0020 U+0434 U+0435 U+043d U+044c

If you actually want to produce a UTF-8 encoded output file then you shouldn't be using the utf-8 iterator anyway.

string s1 = "Добрый день";
std::cout << s1;

When the output is redirected to a file then the file will contain the UTF-8 encoded data:

Добрый день

I don't understand why your actual output currently contains a bunch of extra spaces, but it looks like the actual numbers that are being accessed are:

63 63 63 63 63 63 32 63 63 63 63

63 is the ascii code for '?' and 32 is the ascii code for a space; ?????? ????. So you are clearly suffering from VC++'s conversion of the string literal to the system locale encoding.

179

answered Oct 22 '22 20:10

bames53

Related questions
                            
                                Python binding for C++ operator overloading
                            
                                How do I get a specific type from a variadic type pack?
                            
                                Can't build C++ program using Sublime Text 2
                            
                                Do I have to repeat the inlined keyword on function implementation
                            
                                Installing openCV 2.4 for C/C++ for Visual Studio
                            
                                Is it possible to specify multiple types for a function parameter?
                            
                                Qt Widget - how to capture just a few keyboard keys
                            
                                Threads in C++ builder [closed]
                            
                                Managing apps volume in Windows 7
                            
                                C++11 Can I ensure a condition_variable.wait() won't miss a notification?
                            
                                Different result with OPENCV C and C++ API (Border Interpolation difference)
                            
                                Find subsequences of a string whose length is as large as 10,000
                            
                                Audio Processing C++ - FFT
                            
                                Bitwise Manipulation Functions [duplicate]
                            
                                what is a 'valid' std::function?
                            
                                Problems with stream interface in C++
                            
                                how to pass C++ callbacks between unrelated classes?
                            
                                Why is g++/ld running out of memory?
                            
                                C++ cin keypress event
                            
                                Warn about class member self-initialization

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With