Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding "actual" characters (graphemes) in a QString

Tags:

unicode

utf-16

qt

Let's say I have a QString that may consist of any Unicode characters, and I want to iterate through its characters or count them. And by "characters" I mean what the user perceives as such (so roughly equivalent to "glyphs") and not simply QChars (16-bit Unicode characters). Some "actual" characters are built of several QChars (surrogate pairs; base character + combining marks). For some combining characters I might get away with normalizing the string to create composite characters, but that does not always help.

Have I overlooked a built-in function that splits a QString into "actual" characters?

Or if I have to parse it myself, is this the structure (in EBNF) or am I missing something?

character = ((high_surrogate, low_surrogate) | base_character), {combining_mark}

(with base_character being every QChar that is not a surrogate or combining character)

like image 616
Sebastian Negraszus Avatar asked Nov 04 '11 14:11

Sebastian Negraszus


1 Answers

After more research I found the term for "actual character", grapheme, and with it the Qt class for finding grapheme boundaries: QTextBoundaryFinder.

like image 180
Sebastian Negraszus Avatar answered Sep 18 '22 23:09

Sebastian Negraszus