Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

std::string optimal way to truncate utf-8 at safe place

Tags:

c++

string

utf-8

I have a valid utf-8 encoded string in a std::string. I have limit in bytes. I would like to truncate the string and add ... at MAX_SIZE - 3 - x - where x is that value that will prevent a utf-8 character to be cut.

Is there function that could determine x based on MAX_SIZE without the need to start from the beginning of the string?

like image 360
gsf Avatar asked Feb 11 '16 00:02

gsf


1 Answers

If you have a location in a string, and you want to go backwards to find the start of a UTF-8 character (and therefore a valid place to cut), this is fairly easily done.

You start from the last byte in the sequence. If the top two bits of the last byte are 10, then it is part of a UTF-8 sequence, so keep backing up until the top two bits are not 10 (or until you reach the start).

The way UTF-8 works is that a byte can be one of three things, based on the upper bits of the byte. If the topmost bit is 0, then the byte is an ASCII character, and the next 7 bits are the Unicode Codepoint value itself. If the topmost bit is 10, then the 6 bits that follow are extra bits for a multi-byte sequence. But the beginning of a multibyte sequence is coded with 11 in the top 2 bits.

So if the top bits of a byte are not 10, then it's either an ASCII character or the start of a multibyte sequence. Either way, it's a valid place to cut.

Note however that, while this algorithm will break the string at codepoint boundaries, it ignores Unicode grapheme clusters. This means that combining characters can be culled, away from the base characters that they combine with; accents can be lost from characters, for example. Doing proper grapheme cluster analysis would require having access to the Unicode table that says whether a codepoint is a combining character.

But it will at least be a valid Unicode UTF-8 string. So that's better than most people do ;)


The code would look something like this (in C++14):

auto FindCutPosition(const std::string &str, size_t max_size)
{
  assert(str.size() >= max_size, "Make sure stupidity hasn't happened.");
  assert(str.size() > 3, "Make sure stupidity hasn't happened.");
  max_size -= 3;
  for(size_t pos = max_size; pos > 0; --pos)
  {
    unsigned char byte = static_cast<unsigned char>(str[pos]); //Perfectly valid
    if(byte & 0xC0 != 0x80)
      return pos;
  }

  unsigned char byte = static_cast<unsigned char>(str[0]); //Perfectly valid
  if(byte & 0xC0 != 0x80)
    return 0;

  //If your first byte isn't even a valid UTF-8 starting point, then something terrible has happened.
  throw bad_utf8_encoded_text(...);
}
like image 65
Nicol Bolas Avatar answered Nov 14 '22 02:11

Nicol Bolas