Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterating through a UTF-8 string in C++11

I am trying to iterate through a UTF-8 string. The problem as I understand it is that UTF-8 characters have variable length, so I can't just iterate char-by-char but I have to use some kind of conversion. I am sure there is a function for this in the modern C++ but I don't know what it is.

#include <iostream>
#include <string>

int main()
{
  std::string text = u8"řabcdě";
  std::cout << text << std::endl; // Prints fine
  std::cout << "First letter is: " << text.at(0) << text.at(1) << std::endl; // Again fine. So 'ř' is a 2 byte letter?

  for(auto it = text.begin(); it < text.end(); it++)
  {
    // Obviously wrong. Outputs only ascii part of the text (a, b, c, d) correctly
    std::cout << "Iterating: " << *it << std::endl; 
  }
}

Compiled with clang++ -std=c++11 -stdlib=libc++ test.cpp

From what I've read wchar_t and wstring should not be used.

like image 317
Jan Šimek Avatar asked Sep 27 '14 11:09

Jan Šimek


1 Answers

As n.m. suggested I used std::wstring_convert:

#include <codecvt>
#include <locale>
#include <iostream>
#include <string>

int main()
{
  std::u32string input = U"řabcdě";

  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;

  for(char32_t c : input)
  {
    std::cout << converter.to_bytes(c) << std::endl;
  }
}

Perhaps I should've specified more clearly in the question that I wanted to know if this was possible to do in C++11 without the use of any third party libraries like ICU or UTF8-CPP.

like image 65
Jan Šimek Avatar answered May 27 '23 22:05

Jan Šimek