Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to check for combining characters in Unicode

Tags:

unicode

I intend to normalize to Form C, then divide into "display units", basically a glyph plus all following combining characters. For now, I'm just looking to handle the Latin-based scripts.

To determine if a code point is a combining character, is it enough to check that it is within these ranges?

  • Combining Diacritical Marks (0300–036F)
  • Combining Diacritical Marks Supplement (1DC0–1DFF)
  • Combining Diacritical Marks for Symbols (20D0–20FF)
  • Combining Half Marks (FE20–FE2F)

Arabic, Hebrew and various Indian scripts pending...

like image 516
Yimin Rong Avatar asked Jun 11 '13 19:06

Yimin Rong


People also ask

How do you combine Unicode characters?

Depending from the application or browser there are two ways to use the Unicode Combining Diacritical Marks: With ā (a macron) as example, you may try to type in the 'a' first followed by the decimal code ̄ or ALT+ (it must be the + from the numeric keypad) followed by the hexadecimal code 0304 (i.e U+0304).

How many possible Unicode characters are there?

The Unicode Standard is intended to support the needs of all types of users, whether in business or academia, using mainstream or minority scripts. Q: How many characters are in Unicode? The short answer is that as of Version 15.0, the Unicode Standard contains 149,186 characters.


1 Answers

These are all the ranges of Unicode points, whose name contains the word 'combining' (e.g. 301 COMBINING ACUTE ACCENT):

300-36F
483-489
7EB-7F3
135F-135F
1A7F-1A7F
1B6B-1B73
1DC0-1DE6
1DFD-1DFF
20D0-20F0
2CEF-2CF1
2DE0-2DFF
3099-309A
A66F-A672
A67C-A67D
A6F0-A6F1
A8E0-A8F1
FE20-FE26
101FD-101FD
1D165-1D169
1D16D-1D172
1D17B-1D182
1D185-1D18B
1D1AA-1D1AD
1D242-1D244

I compiled this list with a Python script, making use of the unicodedata module. I don't know what version of Unicode this is exactly, but I think it's reasonably up to date.

However, I don't know if you're done with characters that are 'combining' in the strict sense, as there are also 'modifier letters' and the like in Unicode.

like image 185
lenz Avatar answered Jan 02 '23 05:01

lenz