Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get all characters within a certain UTF-8 language group?

I don't know the exact technical terminology, but UTF-8 as a standard includes characters from certain language groupings, which can be observed in the Windows Character Map with a font like Arial Unicode MS.

  • Latin
  • Cyrillic
  • Greek
  • Hebrew
  • Arabic
  • Devnagari
  • Gujrati
  • Kannada
  • Lao
  • Hiragana
  • Currency Symbols
  • Box Drawings

How do I obtain a list of the characters under each set? This could be an API or just a plain list/DB somewhere on the net. I found the wiki article that lists everything, but not in an iterable form. Any ideas?

like image 802
Robin Rodricks Avatar asked Mar 18 '13 08:03

Robin Rodricks


People also ask

Can UTF-8 represent all characters?

Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.

Does UTF-8 cover all Unicode?

UTF-8 is a character encoding - a way of converting from sequences of bytes to sequences of characters and vice versa. It covers the whole of the Unicode character set.

How many characters are there in UTF-8?

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.

What is the difference between UTF-8 and UTF-16?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.


1 Answers

You can access the entire list of unicode chars at the published UnicodeData.txt which is a CSV formatted file listing every character with group information.

Grouping by class

The third column specifies the character class, in a 2 digit shortform, longforms specified here.

  • letter-character -- classes Lu, Ll, Lt, Lm, Lo, or Nl
  • combining-character -- classes Mn or Mc
  • decimal-digit-character -- class Nd
  • connecting-character -- class Pc
  • formatting-character -- class Cf

Its even possible to iterate through chars of a certain group using C# LINQ:

var charInfo = Enumerable.Range(0, 0x110000)
                         .Where(x => x < 0x00d800 || x > 0x00dfff)
                         .Select(char.ConvertFromUtf32)
                         .GroupBy(s => char.GetUnicodeCategory(s, 0))
                         .ToDictionary(g => g.Key);

foreach (var ch in charInfo[UnicodeCategory.LowercaseLetter])
{
    Console.Write(ch);
}

Grouping by language

However, the language grouping is not explicitly mentioned so you'll have to parse the first word of the name to group each char by language. This is the most reliable method to do so, since every Latin unicode character begins with the prefix "Latin". Examples follow:

  • Latin: Latin Capital Letter A
  • Latin Extended A: Latin Small Letter C with acute
  • Latin Extended B: Latin Capital Letter Tone Six
  • Latin Extended Additional: Latin Capital Letter B With Dot Above
like image 53
Robin Rodricks Avatar answered Oct 19 '22 15:10

Robin Rodricks