Iterating through Unicode codepoints character by character

Question

I've got a series of Unicode codepoints. What I really need to do is iterate through these codepoints as a series of characters, not a series of codepoints, and determine properties of each individual character, e.g. is a letter, whatever.

For example, imagine that I was writing a Unicode-aware textbox, and the user entered a Unicode character that was more than one codepoint- for example, "e with diacritic". I know that this specific character can be represented as one codepoint as well, and can be normalized to that form, but I don't think that's possible in the general case. How could I implement backspace? It obviously can't just erase the last codepoint, because they might have just entered more than one codepoint.

How can I iterate over a bunch of Unicode codepoints as characters?

Edit: The Break Iterators offered by ICU appear to be pretty much what I need. However, I'm not using ICU, so any references on how to implement my own equivalent functionality would be an accepted answer.

Another edit: It turns out that the Windows API does indeed offer this functionality. MSDN just isn't very good about putting all the string functions in one place. CharNext is the function I'm looking for.

bmargulies · Accepted Answer

Use the ICU library.

http://site.icu-project.org/

for example:

http://icu-project.org/apiref/icu4c/classUnicodeString.html#ae3ffb6e15396dff152cb459ce4008f90

is the function that returns the character at a particular character offset in a string.

André Caron · Answer

The UTF8-CPP project has a bunch of clean, easy to read, STL-like algorithms to iterate over Unicode strings codepoint by codepoint, character by character, etc. You can look into that for inspiration.

Note that the "character by character" approach might not be obvious. One easy way to do it is to iterate over an UTF-32 string in normalization form C, which guarantees fixed length encoding.

Iterating through Unicode codepoints character by character

Tags:

c++

unicode

character-properties

Puppy

2 Answers

bmargulies

André Caron

Recent Activity

Donate For Us

Iterating through Unicode codepoints character by character

Tags:

c++

unicode

character-properties

Puppy

2 Answers

bmargulies

André Caron

Related questions

Recent Activity

Donate For Us