Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best Type for UTF-8 data?

Tags:

c++

unicode

utf-8

What is the best type, in C++, for storing UTF-8 string? I'd like to avoid rolling my own class if possible.

My original thought was std::string — however, this uses char as the underlying type. char may be unsigned or signed — it varies. On my system, it's signed. UTF-8 code units, however, are unsigned octets. This seems to indicate that it's the wrong type.

This leads us to std::basic_string<unsigned char> - which seems to fit the bill: unsigned, 8-bit (or larger) chars.

However, most things seem to use char. glib, for example, uses char. C++'s ostream's use char.

Thoughts?

like image 262
Thanatos Avatar asked Sep 29 '09 02:09

Thanatos


1 Answers

I'd just use std::string, as it is consistent with the UTF-8 ideal of treating data just as you would null-terminated ASCII strings unless you actually need their unicode-ness.

I also like GTKmm's Glib::ustring, but that only works if you're writing a GTKmm (or at least Glibmm) application.

like image 79
Michael Ekstrand Avatar answered Sep 19 '22 10:09

Michael Ekstrand