I am trying to iterate through a UTF-8 string. The problem as I understand it is that UTF-8 characters have variable length, so I can't just iterate char-by-char but I have to use some kind of conversion. I am sure there is a function for this in the modern C++ but I don't know what it is.
#include <iostream>
#include <string>
int main()
{
std::string text = u8"řabcdě";
std::cout << text << std::endl; // Prints fine
std::cout << "First letter is: " << text.at(0) << text.at(1) << std::endl; // Again fine. So 'ř' is a 2 byte letter?
for(auto it = text.begin(); it < text.end(); it++)
{
// Obviously wrong. Outputs only ascii part of the text (a, b, c, d) correctly
std::cout << "Iterating: " << *it << std::endl;
}
}
Compiled with clang++ -std=c++11 -stdlib=libc++ test.cpp
From what I've read wchar_t
and wstring
should not be used.
As n.m. suggested I used std::wstring_convert
:
#include <codecvt>
#include <locale>
#include <iostream>
#include <string>
int main()
{
std::u32string input = U"řabcdě";
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
for(char32_t c : input)
{
std::cout << converter.to_bytes(c) << std::endl;
}
}
Perhaps I should've specified more clearly in the question that I wanted to know if this was possible to do in C++11 without the use of any third party libraries like ICU or UTF8-CPP.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With