Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

counting unicode characters in c++

Tags:

c++

unicode

How do you count unicode characters in a UTF-8 file in C++? Perhaps if someone would be so kind to show me a "stand alone" method, or alternatively, a short example using http://icu-project.org/index.html.

EDIT: An important caveat is that I need to build counts of each character, so it's not like I'm counting the total number of characters, but the number of occurrences of a set of characters.

like image 463
Dervin Thunk Avatar asked Nov 29 '22 18:11

Dervin Thunk


2 Answers

In UTF-8, a non-leading byte always has the top two bits set to 10, so just ignore all such bytes. If you don't mind extra complexity, you can do more than that (to skip ahead across non-leading bytes based on the bit pattern of a leading byte) but in reality, it's unlikely to make much difference except for short strings (because you'll typically be close to the memory bandwidth anyway).

Edit: I originally mis-read your question as simply asking about how to count the length of a string of characters encoded in UTF-8. If you want to count character frequencies, you probably want to convert those to UTF-32/UCS-4, then you'll need some sort of sparse array to count the frequencies.

The hard part of this deals with counting code points vs. characters. For example, consider the character "À" -- the "Latin capital letter A with grave". There are at least two different ways to produce this character. You can use codepoint U+00C0, which encodes the whole thing in a single code point, or you can use codepoint U+0041 (Latin capital letter A) followed by codepoint U+0300 (Combining grave accent).

Normalizing (with respect to Unicode) means turning all such characters into the the same form. You can either combine them all into a single codepoint, or separate them all into separate code points. For your purposes, it's probably easier to combine them into into a single code point whenever possible. Writing this on your own probably isn't very practical -- I'd use the normalizer API from the ICU project.

like image 129
Jerry Coffin Avatar answered Dec 04 '22 09:12

Jerry Coffin


If you know the UTF-8 sequence is well formed, it's quite easy. Count up each byte that starts with a zero bit or two one bits. The first condition will chatch every code point that is represented by a single byte, the second will catch the first byte of each multi-byte sequence.

while (*p != 0)
{
    if ((*p & 0x80) == 0 || (*p & 0xc0) == 0xc0)
        ++count;
    ++p;
}

Or alternatively as remarked in the comments, you can simply skip every byte that's a continuation:

while (*p != 0)
{
    if ((*p & 0xc0) != 0x80)
        ++count;
    ++p;
}

Or if you want to be super clever and make it a 2-liner:

for (p; *p != 0; ++p)
    count += ((*p & 0xc0) != 0x80);

The Wikipedia page for UTF-8 clearly shows the patterns.

like image 26
Mark Ransom Avatar answered Dec 04 '22 09:12

Mark Ransom