Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining (some) Unicode nonspacing marks with associated letters for uniform processing

Tags:

c#

text

unicode

I am writing a text-processing Windows app in C#. The app processes many plain text files to count characters, words, etc. To do this, the app iterates over the characters in each file. I am finding that some text files represent accented letters such as Γ‘ by using the Unicode character U+00E1 (small letter A with acute) while other use a simple unaccented a (U+0061, small letter A) followed by a U+0301 (combining acute accent). There's no visual difference in how the text is rendered on screen by Notepad or other editors I've used, but the underlying character stream is obviously different.

I would like to detect and treat these two situations in the same way. In other words, I would like my app to combine a letter followed by a combining codepoint into the equivalent self-contained character. For example, I'd like to combine the sequence U+0061 U+0301 into U+00E1. As far as I know, there is no simple algorithm to do this, other than a large and error-prone lookup table for all the possible combinations of plain letters and combining characters.

I there a simpler and more direct algorithm to perform this combination?

like image 347
CesarGon Avatar asked Oct 20 '25 14:10

CesarGon


1 Answers

You're referring to Unicode normalization forms. That page goes into some interesting detail, but the gist is that representing e.g. accented letters as a single codepoint (e.g. Γ‘ as U+00E1) is Normalization Form C, or NFC, and as separate codepoints (e.g. Γ‘ as U+0061 U+0301) is NFD.

Section 3.11 of the Unicode specification goes into the gory details of how to implement it, with some extra details here.

Luckily, you don't need to implement this yourself: string.Normalize() already exists.

"\u00E1".Normalize(NormalizationForm.FormD); // \u0061\u0301
"\u0061\u0301".Normalize(NormalizationForm.FormC); // \u00E1

That said, we've only just scratched the surface of what a "character" is. A good illustration of this uses emoji, but it applies to various scripts as well: there are modern scripts where normal characters are comprise of two codepoints, and there is no single combined codepoint available. This pops up in e.g. Tamil and Thai, as well as some eastern European langauges (IIRC).

My favourite example is πŸ‘©πŸ½β€πŸš’, or "Woman Firefighter: Medium Skin Tone". Want to guess how that's encoded? That's right, 4 different code points: U+1F469 U+1F3FD U+200D U+1F692.

  • U+1F469 is πŸ‘©, the Woman emoji.
  • U+1F3FD is "Emoji Modifier Fitzpatrick Type-4", which modifies the previous emoji to give a brown skin tone πŸ‘©πŸ½, rendered as 🏽 when it appears on its own.
  • U+200D is a "zero-width joiner", which is used to glue codepoints together into the same character
  • U+1F692 is πŸš’, the Fire Engine emoji.

So you take a woman, add a brown skin tone, glue her to a fire engine, and you get a woman brown-skinned firefighter.

(Just for fun, try pasting πŸ‘©πŸ½β€πŸš’ into various editors and then using backspace on it. If it's rendered properly, some editors turn it into πŸ‘©πŸ½ and then πŸ‘© and then delete it, while others skip various parts. However, you select it as a single character. This mirrors how editing complex characters works in some scripts).

(Another fun nugget are the flag emoji. Unicode defines "Regional Indicator Symbol Letters A-Z" (U+1F1E6 through U+1F1FF), and a flag is encoded as the country's ISO 3166-1 alpha-2 letter country code using these indicator symbols. So πŸ‡ΊπŸ‡Έ is πŸ‡Ί followed by πŸ‡Έ. Paste the πŸ‡Έ after the πŸ‡Ί and a flag appears!)

Of course, if you're iterating over this codepoint-by-codepoint you're going to visit U+1F469 U+1F3FD U+200D U+1F692 individually, which probably isn't what you want.

If you're iterating this char-by-char you're going to do even worse due to surrogate pairs: those codepoints such as U+1F469 are simply too large to represent using a single 16-bit char, so we need to use two of them. This means that if you try to iterate over U+1F469, you'll actually find you've got two chars: 0xD83D (the high surrogate) and 0xDC69 (the low surrogate).

Instead, we need to introduce extended grapheme clusters, which represent what you'd traditionally think of as a single character. Again there's a bunch of complexity if you want to do this yourself, and again someone's helpfully done it for you: StringInfo.GetTextElementEnumerator. Note that this was a bit buggy pre-.NET 5, and didn't properly handle all EGCs.

In .NET 5, however:

// Number of chars, as 3 of the codepoints need to use surrogate pairs when
// encoded with UTF-16
"πŸ‘©πŸ½β€πŸš’".Length; // 7

// Number of Unicode codepoints
"πŸ‘©πŸ½β€πŸš’".EnumerateRunes().Count(); // 4

// Number of extended grapheme clusters
GetTextElements("πŸ‘©πŸ½β€πŸš’").Count(); // 1

public static IEnumerable<string> GetTextElements(string s)
{
    TextElementEnumerator charEnum = StringInfo.GetTextElementEnumerator(s);
    while (charEnum.MoveNext())
    {
        yield return charEnum.GetTextElement();
    }
}

I've used emoji as an understandable example here, but these issues also crop up in modern scripts, and people working with text need to be aware of them.

like image 121
canton7 Avatar answered Oct 22 '25 03:10

canton7



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!