I have the task of counting the number of perceived characters in an input. The input is a group of ints (we can think of it as an int[]
) which represents Unicode code points.
java.text.BreakIterator.getCharacterInstance() is not allowed. (I mean their formula is allowed and is what I wanted, but weaving through their source code and state tables got me nowhere >.<)
I was wondering what's the correct algorithm to count the number of grapheme-clusters given some code points?
Initially, I'd thought that all I have to do is to combine all occurences of:
U+0300 – U+036F
(combining diacritical marks)
U+1DC0 – U+1DFF
(combining diacritical marks supplement)
U+20D0 – U+20FF
(combining diacritical marks for symbols)
U+FE20 - U+FE2F
(combining half marks)
into the previous non-diacritic-mark.
However I've realised that prior to that operation, I have to first remove all non-characters as well.
This includes:
U+FDD0 - U+FDEF
The last two code points of every plane
But there seems to be more things to do. Unicode.org states we need to include U+200C
(zero-width non joiner) and U+200D
(zero width joiner) as part of the set of continuing characters (source).
Besides that, it talks about a couple more things but the entire topic is treated in an abstract way. For example, what are the code point ranges for spacing combining marks, hangul jamo characters that forms hangul syllables?
Does anyone know the correct algorithm to count the number of grapheme-clusters given an int[]
of code points?
There's not a single canonical method appropriate to all uses, but a good starting point is the Unicode Grapheme Cluster Boundary algorithm on the Unicode.org page you link to. Basically, Unicode provides a database of each code point's grapheme break property, and then describes an algorithm to decide if a grapheme break is allowed between two code points based on their assigned grapheme break properties.
Here's part of an implementation (in C++) I played around with a while ago:
bool BoundaryAllowed(char32_t cp, char32_t cp2) {
// lbp: left break property; rbp: right break property
auto lbp = get_property_for_codepoint(cp),
rbp = get_property_for_codepoint(cp2);
// Do not break between a CR and LF. Otherwise, break before and after
// controls.
if ((CR == lbp && LF == rbp)) {
// The Unicode grapheme boundary algorithm does not handle LFCR new lines
return false;
}
if (Control == lbp || CR == lbp || LF == lbp || Control == rbp || CR == rbp ||
LF == rbp) {
return true;
}
// Do not break Hangul syllable sequences.
if ((L == lbp && (L == rbp || V == rbp || LV == rbp || LVT == rbp)) ||
((LV == lbp || V == lbp) && (V == rbp || T == rbp)) ||
((LVT == lbp || T == lbp) && (T == rbp))) {
return false;
}
// Do not break before extending characters.
if (Extend == rbp) {
return false;
}
// Do not break before SpacingMarks, or after Prepend characters.
if (Prepend == lbp || SpacingMark == rbp) {
return false;
}
return true; // Otherwise, break everywhere.
}
In order to obtain the ranges for the different types of codepoints you'll just have to look at the Unicode Character Database. The file with the grapheme break properties, which describes them in terms of the ranges, is about 1200 lines long: http://www.unicode.org/Public/6.1.0/ucd/auxiliary/
I'm not really sure how much value there is in ignoring non-character code points, but if your use requires it then you'll add that in to your implementation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With