Are NFC normalization boundaries also extended grapheme cluster boundaries?

Tags:

This question is related to text editing. Say you have a piece of text in normalization form NFC, and a cursor that points to an extended grapheme cluster boundary within this text. You want to insert another piece of text at the cursor location, and make sure that the resulting text is also in NFC. You also want to move the cursor on the first grapheme boundary that immediately follows the inserted text.

Now, since concatenating two strings that are both in NFC doesn't necessarily produce a string that is also in NFC, you might have to emend the text around the insertion point. For instance, if you have a string that contains 4 code points like so:

[0] LATIN SMALL LETTER B
[1] LATIN SMALL LETTER E
[2] COMBINING MACRON BELOW
--- Cursor location
[3] LATIN SMALL LETTER A

And you want to insert a 2-codepoints string {COMBINING ACUTE ACCENT, COMBINING DOT ABOVE} at the cursor location. Then the result will be:

[0] LATIN SMALL LETTER B
[1] LATIN SMALL LETTER E WITH ACUTE
[2] COMBINING MACRON BELOW
[3] COMBINING DOT ABOVE
--- Cursor location
[4] LATIN SMALL LETTER A

Now my question is: how do you figure out at which offset you should place the cursor after inserting the string, in such a way that the cursor ends up after the inserted string and also on a grapheme boundary? In this particular case, the text that follows the cursor location cannot possibly interact, during normalization, with what precedes. So the following sample Python code would work:

import unicodedata

def insert(text, cursor_pos, text_to_insert):
    new_text = text[:cursor_pos] + text_to_insert
    new_text = unicodedata.normalize("NFC", new_text)
    new_cursor_pos = len(new_text)
    new_text += text[cursor_pos:]
    if new_cursor_pos == 0:
        # grapheme_break_after is a function that
        # returns the offset of the first grapheme
        # boundary after the given index
        new_cursor_pos = grapheme_break_after(new_text, 0)
    return new_text, new_cursor_pos

But does this approach necessarily work? To be more explicit: is it necessarily the case that the text that follows a grapheme boundary doesn't interact with what precedes it during normalization, such that NFC(text[:grapheme_break]) + NFC(text[grapheme_break:]) == NFC(text) is always true?

Update

@nwellnhof's excellent analysis below motivated me to investigate things further. So I followed the "When in doubt, use brute force" mantra and wrote a small script that parses grapheme break properties and examines each code point that can appear at the beginning of a grapheme, to test whether it can possibly interact with preceding code points during normalization. Here's the script:

from urllib.request import urlopen
import icu, unicodedata

URL = "http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt"

break_props = {}

with urlopen(URL) as f:
    for line in f:
        line = line.decode()
        p = line.find("#")
        if p >= 0:
            line = line[:p]
        line = line.strip()
        if not line:
            continue
        fields = [x.strip() for x in line.split(";")]
        codes = [int(x, 16) for x in fields[0].split("..")]
        if len(codes) == 2:
            start, end = codes
        else:
            assert(len(codes) == 1)
            start, end = codes[0], codes[0]
        category = fields[1]
        break_props.setdefault(category, []).extend(range(start, end + 1))

# The only code points that can't appear at the beginning of a grapheme boundary
# are those that appear in the following categories. See the regexps in
# UAX #29 Tables 1b and 1c.
to_ignore = set(c for name in ("Extend", "ZWJ", "SpacingMark") for c in break_props[name])

nfc = icu.Normalizer2.getNFCInstance()
for c in range(0x10FFFF + 1):
    if c in to_ignore:
        continue
    if not nfc.hasBoundaryBefore(chr(c)):
        print("U+%04X %s" % (c, unicodedata.name(chr(c))))

Looking at the output, it appears that there are about 40 code points that are grapheme starters but still compose with preceding code points in NFC. Basically, they are non-precomposed Hangul syllables of type V (U+1161..U+1175) and T (U+11A8..U+11C2). Things makes sense when you examine the regular expressions in UAX #29, Table 1c together with what the standard says about Jamo composition (section 3.12, p. 147 of the version 13 of the standard). The gist of it is that Hangul sequences of the form {L, V} can compose to a Hangul syllable of type LV, and similarly sequences of the form {LV, T} can compose to a syllable of type LVT.

To sum up, and assuming I'm not mistaken, the above Python code could be corrected as follows:

import unicodedata
import icu # pip3 install icu

def insert(text, cursor_pos, text_to_insert):
    new_text = text[:cursor_pos] + text_to_insert
    new_text = unicodedata.normalize("NFC", new_text)
    new_cursor_pos = len(new_text)
    new_text += text[cursor_pos:]
    new_text = unicodedata.normalize("NFC", new_text)
    break_iter = icu.BreakIterator.createCharacterInstance(icu.Locale())
    break_iter.setText(new_text)
    if new_cursor_pos == 0:
        # Move the cursor to the first grapheme boundary > 0.
        new_cursor_pos = breakIter.nextBoundary()
    elif new_cursor_pos > len(new_text):
        new_cursor_pos = len(new_text)
    elif not break_iter.isBoundary(new_cursor_pos):
        # isBoundary() moves the cursor on the first boundary >= the given
        # position.
        new_cursor_pos = break_iter.current()
    return new_text, new_cursor_pos

The (possibly) pointless test new_cursor_pos > len(new_text) is there to catch the case len(NFC(x)) > len(NFC(x + y)). I'm not sure whether this can actually happen with the current Unicode database (more tests would be needed to prove it), but it is theoretically quite possible. If, say, you have a set a three code points A, B and C and two precomposed forms A+B and A+B+C (but not A+C), then you could very well have NFC({A, C} + {B}) = {A+B+C}.

If this case doesn't occur in practice (which is very likely, especially with "real" texts), then the above Python code will necessarily locate the first grapheme boundary after the end of the inserted text. Otherwise, it will merely locate some grapheme boundary after the inserted text, but not necessarily the first one. I don't yet see how it could be possible to improve the second case (assuming it isn't merely theoretical), so I think I'll leave my investigation at that for now.

884

asked Mar 18 '21 14:03

michaelmeyer

1 Answers

As mentioned in my comment, the actual boundaries can differ slightly. But AFAICS, there should be no meaningful interaction. UAX #29 states:

6.1 Normalization

[...] the grapheme cluster boundary specification has the following features:

There is never a break within a sequence of nonspacing marks.

There is never a break between a base character and subsequent nonspacing marks.

This only mentions nonspacing marks. But with extended grapheme clusters (as opposed to legacy ones), I'm pretty sure these statements also apply to "non-starter" spacing marks^[1]. This would cover all normalization non-starters (which must be either nonspacing (Mn) or spacing (Mc) marks). So there's never an extended grapheme cluster boundary before a non-starter^[2] which should give you the guarantee you need.

Note that it's possible to have multiple runs of starters and non-starters ("normalization boundaries") within a single grapheme cluster, for example with U+034F COMBINING GRAPHEME JOINER.

[1] Some spacing marks are excluded, but these should all be starters.

[2] Except at the start of text.

answered Sep 27 '22 18:09

nwellnhof

Related questions
                            
                                Do all kinds of newlines get converted to \r\n when submitted through a html form?
                            
                                How can I detect a palindrome in Hebrew?
                            
                                Python subprocess echo a unicode literal
                            
                                Can N function cause problems with existing queries?
                            
                                Emoji display issue in vim with tmux
                            
                                Actions on Google Node.js SDK unicode
                            
                                Why is Unicode combining character order different between IDEA and Chrome?
                            
                                How this mixed-character string split on unicode word boundaries
                            
                                Problem when using python logging in django and unicode
                            
                                Java, JavaCC: How to parse characters outside the BMP?
                            
                                Splitting string by fixed length
                            
                                Writing unicode strings to Excel 2007
                            
                                Is there any unicode character who's glyph is missing in all fonts? [duplicate]
                            
                                Convert single byte string to double byte string
                            
                                Unicode character with subscript
                            
                                How is Levenshtein Distance calculated on Simplified Chinese characters?
                            
                                Cannot query SQL Server table containing Arabic from my Java app using HQL
                            
                                Who performs unicode normalization and when?
                            
                                How to get all characters of an arbitrary encoding?
                            
                                Is there a blank Unicode character matching emoji width?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Are NFC normalization boundaries also extended grapheme cluster boundaries?

Tags:

unicode

unicode-normalization

Update

michaelmeyer

People also ask

1 Answers

nwellnhof

Recent Activity

Donate For Us