I am writing a text-processing Windows app in C#. The app processes many plain text files to count characters, words, etc. To do this, the app iterates over the characters in each file. I am finding that some text files represent accented letters such as Γ‘
by using the Unicode character U+00E1 (small letter A with acute) while other use a simple unaccented a
(U+0061, small letter A) followed by a U+0301 (combining acute accent). There's no visual difference in how the text is rendered on screen by Notepad or other editors I've used, but the underlying character stream is obviously different.
I would like to detect and treat these two situations in the same way. In other words, I would like my app to combine a letter followed by a combining codepoint into the equivalent self-contained character. For example, I'd like to combine the sequence U+0061 U+0301 into U+00E1. As far as I know, there is no simple algorithm to do this, other than a large and error-prone lookup table for all the possible combinations of plain letters and combining characters.
I there a simpler and more direct algorithm to perform this combination?
You're referring to Unicode normalization forms. That page goes into some interesting detail, but the gist is that representing e.g. accented letters as a single codepoint (e.g. Γ‘
as U+00E1) is Normalization Form C, or NFC, and as separate codepoints (e.g. Γ‘
as U+0061 U+0301) is NFD.
Section 3.11 of the Unicode specification goes into the gory details of how to implement it, with some extra details here.
Luckily, you don't need to implement this yourself: string.Normalize()
already exists.
"\u00E1".Normalize(NormalizationForm.FormD); // \u0061\u0301
"\u0061\u0301".Normalize(NormalizationForm.FormC); // \u00E1
That said, we've only just scratched the surface of what a "character" is. A good illustration of this uses emoji, but it applies to various scripts as well: there are modern scripts where normal characters are comprise of two codepoints, and there is no single combined codepoint available. This pops up in e.g. Tamil and Thai, as well as some eastern European langauges (IIRC).
My favourite example is π©π½βπ, or "Woman Firefighter: Medium Skin Tone". Want to guess how that's encoded? That's right, 4 different code points: U+1F469 U+1F3FD U+200D U+1F692.
So you take a woman, add a brown skin tone, glue her to a fire engine, and you get a woman brown-skinned firefighter.
(Just for fun, try pasting π©π½βπ into various editors and then using backspace on it. If it's rendered properly, some editors turn it into π©π½ and then π© and then delete it, while others skip various parts. However, you select it as a single character. This mirrors how editing complex characters works in some scripts).
(Another fun nugget are the flag emoji. Unicode defines "Regional Indicator Symbol Letters A-Z" (U+1F1E6 through U+1F1FF), and a flag is encoded as the country's ISO 3166-1 alpha-2 letter country code using these indicator symbols. So πΊπΈ is πΊ followed by πΈ. Paste the πΈ after the πΊ and a flag appears!)
Of course, if you're iterating over this codepoint-by-codepoint you're going to visit U+1F469 U+1F3FD U+200D U+1F692 individually, which probably isn't what you want.
If you're iterating this char-by-char you're going to do even worse due to surrogate pairs: those codepoints such as U+1F469 are simply too large to represent using a single 16-bit char, so we need to use two of them. This means that if you try to iterate over U+1F469, you'll actually find you've got two chars: 0xD83D (the high surrogate) and 0xDC69 (the low surrogate).
Instead, we need to introduce extended grapheme clusters, which represent what you'd traditionally think of as a single character. Again there's a bunch of complexity if you want to do this yourself, and again someone's helpfully done it for you: StringInfo.GetTextElementEnumerator
. Note that this was a bit buggy pre-.NET 5, and didn't properly handle all EGCs.
In .NET 5, however:
// Number of chars, as 3 of the codepoints need to use surrogate pairs when
// encoded with UTF-16
"π©π½βπ".Length; // 7
// Number of Unicode codepoints
"π©π½βπ".EnumerateRunes().Count(); // 4
// Number of extended grapheme clusters
GetTextElements("π©π½βπ").Count(); // 1
public static IEnumerable<string> GetTextElements(string s)
{
TextElementEnumerator charEnum = StringInfo.GetTextElementEnumerator(s);
while (charEnum.MoveNext())
{
yield return charEnum.GetTextElement();
}
}
I've used emoji as an understandable example here, but these issues also crop up in modern scripts, and people working with text need to be aware of them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With