Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Utf-8 in c++: quick & dirty tricks

Tags:

c++

utf-8

I am aware that there are been various questions about utf-8, mainly about libraries to manipulate utf-8 'string' like objects.

However, I am working on an 'internationalized' project (a website, of which I code a c++ backend... don't ask) where even if we deal with utf-8 we don't acutally need such libraries. Most of the times the plain std::string methods or STL algorithms are very sufficient to our needs, and indeed this is the goal of using utf-8 in the first place.

So, what I am looking for here is a capitalization of the "Quick & Dirty" tricks that you know of related to utf-8 stored as std::string (no const char*, I don't care about c-style code really, I've got better things to do than constantly worrying about my buffer size).

For example, here is a "Quick & Dirty" trick to obtain the number of characters (which is useful to know if it will fit in your display box):

#include <string>
#include <algorithm>

// Let's remember than in utf-8 encoding, a character may be
// 1 byte: '0.......'
// 2 bytes: '110.....' '10......'
// 3 bytes: '1110....' '10......' '10......'
// 4 bytes: '11110...' '10......' '10......' '10......'
// Therefore '10......' is not the beginning of a character ;)

const unsigned char mask = 0xC0;
const unsigned char notUtf8Begin = 0x80;

struct Utf8Begin
{
  bool operator(char c) const { return (c & mask) != notUtf8Begin; }
};

// Let's count
size_t countUtf8Characters(const std::string& s)
{
  return std::count_if(s.begin(), s.end(), Utf8Begin());
}

In fact I have yet to encounter a usecase when I would need anything else than the number of characters and that std::string or the STL algorithms don't offer for free since:

  • sorting works as expected
  • no part of a word can be confused as a word or part of another word

I would like to know if you have other comparable tricks, both for counting and for other simple tasks.
I repeat, I know about ICU and Utf8-CPP, but I am not interested in them since I don't need a full-fledged treatment (and in fact I have never needed more than the count of characters).
I also repeat that I am not interested in treating char*'s, they are old-fashioned.

like image 670
Matthieu M. Avatar asked Sep 30 '09 17:09

Matthieu M.


2 Answers

Well this dirty trick will not work. First, what is the value of mask after this:

   const unsigned char mask = 0x11000000;
   const unsigned char notUtf8Begin = 0x10000000;

Perhaps you are mixing hex representation with binary.

Second, as you correctly say in utf-8 encoding, a character may be several bytes long. std::count_if will iterate through all bytes in a UTF8 sequence. But what you actually need is to look at leading byte for every character and skip the rest until the next character comes.

It will not be hard to implement a single cycle which does the calculation and jumping forward using the simple mask table for leading bytes.

At the end you get the same O(n) for checking the characters and it will work with every UTF8 string.

like image 179
alexkr Avatar answered Oct 16 '22 03:10

alexkr


Sorting UTF_8 as binary will not sort in 'Unicode' order. BOCU-1 would. As was said, your "as expected" is a pretty low bar for non-English content.

like image 45
Steven R. Loomis Avatar answered Oct 16 '22 05:10

Steven R. Loomis