Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write a std::string to a UTF-8 text file

Tags:

c++

utf-8

People also ask

How do I encode a string in UTF-8?

In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.

Is std::string UTF-8?

Both std::string and std::wstring must use UTF encoding to represent Unicode. On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units); note that the size of wchar_t is platform-dependent.

Is std::string Unicode?

And as std::string works with char , so std::string is already unicode-ready. Note that std::string , like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.


The only way UTF-8 affects std::string is that size(), length(), and all the indices are measured in bytes, not characters.

And, as sbi points out, incrementing the iterator provided by std::string will step forward by byte, not by character, so it can actually point into the middle of a multibyte UTF-8 codepoint. There's no UTF-8-aware iterator provided in the standard library, but there are a few available on the 'Net.

If you remember that, you can put UTF-8 into std::string, write it to a file, etc. all in the usual way (by which I mean the way you'd use a std::string without UTF-8 inside).

You may want to start your file with a byte order mark so that other programs will know it is UTF-8.


There is nice tiny library to work with utf8 from c++: utfcpp


libiconv is a great library for all our encoding and decoding needs.

If you are using Windows you can use WideCharToMultiByte and specify that you want UTF8.


What is the easiest and simple way to do so?

The most intuitive and thus easiest handling of utf8 in C++ is for sure using a drop-in replacement for std::string. As the internet still lacks of one, I went to implement the functionality on my own:

tinyutf8 (EDIT: now Github).

This library provides a very lightweight drop-in preplacement for std::string (or std::u32string if you will, because you iterate over codepoints rather that chars). Ity is implemented succesfully in the middle between fast access and small memory consumption, while being very robust. This robustness to 'invalid' UTF8-sequences makes it (nearly completely) compatible with ANSI (0-255).

Hope this helps!


If by "simple" you mean ASCII, there is no need to do any encoding, since characters with an ASCII value of 127 or less are the same in UTF-8.


std::wstring text = L"Привет";
QString qstr = QString::fromStdWString(text);
QByteArray byteArray(qstr.toUtf8());    
std::string str_std( byteArray.constData(), byteArray.length());