I don't know the exact technical terminology, but UTF-8 as a standard includes characters from certain language groupings, which can be observed in the Windows Character Map with a font like Arial Unicode MS. <ul> <li>Latin</li> <li>Cyrillic</li> <li>Greek</li> <li>Hebrew</li> <li>Arabic</li> <li>Devnagari</li> <li>Gujrati</li> <li>Kannada</li> <li>Lao</li> <li>Hiragana</li> <li>Currency Symbols</li> <li>Box Drawings</li> </ul> How do I obtain a list of the characters under each set? This could be an API or just a plain list/DB somewhere on the net. I found the wiki article that lists everything, but not in an iterable form. Any ideas?

You can access the entire list of unicode chars at the published UnicodeData.txt which is a CSV formatted file listing every character with group information. <h3>Grouping by class</h3> The third column specifies the character class, in a 2 digit shortform, longforms specified here. <ul> <li> <code>letter-character</code> -- classes Lu, Ll, Lt, Lm, Lo, or Nl</li> <li> <code>combining-character</code> -- classes Mn or Mc</li> <li> <code>decimal-digit-character</code> -- class Nd</li> <li> <code>connecting-character</code> -- class Pc </li> <li> <code>formatting-character</code> -- class Cf </li> </ul> Its even possible to iterate through chars of a certain group using C# LINQ: <pre class="prettyprint"><code>var charInfo = Enumerable.Range(0, 0x110000) .Where(x => x < 0x00d800 || x > 0x00dfff) .Select(char.ConvertFromUtf32) .GroupBy(s => char.GetUnicodeCategory(s, 0)) .ToDictionary(g => g.Key); foreach (var ch in charInfo[UnicodeCategory.LowercaseLetter]) { Console.Write(ch); } </code></pre> <h3>Grouping by language</h3> However, the language grouping is not explicitly mentioned so you'll have to parse the first word of the name to group each char by language. This is the most reliable method to do so, since every Latin unicode character begins with the prefix "Latin". Examples follow: <ul> <li>Latin: Latin Capital Letter A </li> <li>Latin Extended A: Latin Small Letter C with acute </li> <li>Latin Extended B: Latin Capital Letter Tone Six </li> <li>Latin Extended Additional: Latin Capital Letter B With Dot Above </li> </ul>

How to get all characters within a certain UTF-8 language group?

Tags:

c#

.net

utf-8

character

fonts

I don't know the exact technical terminology, but UTF-8 as a standard includes characters from certain language groupings, which can be observed in the Windows Character Map with a font like Arial Unicode MS.

Latin
Cyrillic
Greek
Hebrew
Arabic
Devnagari
Gujrati
Kannada
Lao
Hiragana
Currency Symbols
Box Drawings

How do I obtain a list of the characters under each set? This could be an API or just a plain list/DB somewhere on the net. I found the wiki article that lists everything, but not in an iterable form. Any ideas?

802

asked Mar 18 '13 08:03

Robin Rodricks

1 Answers

You can access the entire list of unicode chars at the published UnicodeData.txt which is a CSV formatted file listing every character with group information.

Grouping by class

The third column specifies the character class, in a 2 digit shortform, longforms specified here.

letter-character -- classes Lu, Ll, Lt, Lm, Lo, or Nl
combining-character -- classes Mn or Mc
decimal-digit-character -- class Nd
connecting-character -- class Pc
formatting-character -- class Cf

Its even possible to iterate through chars of a certain group using C# LINQ:

var charInfo = Enumerable.Range(0, 0x110000)
                         .Where(x => x < 0x00d800 || x > 0x00dfff)
                         .Select(char.ConvertFromUtf32)
                         .GroupBy(s => char.GetUnicodeCategory(s, 0))
                         .ToDictionary(g => g.Key);

foreach (var ch in charInfo[UnicodeCategory.LowercaseLetter])
{
    Console.Write(ch);
}

Grouping by language

However, the language grouping is not explicitly mentioned so you'll have to parse the first word of the name to group each char by language. This is the most reliable method to do so, since every Latin unicode character begins with the prefix "Latin". Examples follow:

Latin: Latin Capital Letter A
Latin Extended A: Latin Small Letter C with acute
Latin Extended B: Latin Capital Letter Tone Six
Latin Extended Additional: Latin Capital Letter B With Dot Above

answered Oct 19 '22 15:10

Robin Rodricks

Related questions
                            
                                how to use where for operators at generics class c#? [duplicate]
                            
                                Using a custom query to select items where their id exists within a list of IDs
                            
                                Get All DLLS For A Process
                            
                                Problems with the input of non-English characters into a C# console app
                            
                                Alternative to calling a virtual method in C#
                            
                                How to export WKT from a Shapefile in c#?
                            
                                Get string representing the expression used as function argument in C#
                            
                                How can I overcome Windows Component limitation to windows runtime types?
                            
                                How to determine how an assembly was built
                            
                                Switch to a different IObservable if the first is empty
                            
                                Passing `null` reference for a `ref struct` parameter in interop method
                            
                                How to allow a Server to accept both SSL and plain text (insecure) connections?
                            
                                Are there reference implementations of hot-swapping in .NET?
                            
                                Need help converting C# to VB [closed]
                            
                                What is C# dynamic keyword equivalent in C++ CLI?
                            
                                Decrypt AES256 value created in Salesforce using C#
                            
                                One-way async/await calls in WCF
                            
                                C# .NET Socket connection issue - Only one usage of each socket address is normally permitted
                            
                                Image resize when rotate
                            
                                System.Data.DuplicateNameException in DataTable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get all characters within a certain UTF-8 language group?

Tags:

c#

.net

utf-8

character

fonts

Robin Rodricks

People also ask

1 Answers

Grouping by class

Grouping by language

Robin Rodricks

Recent Activity

Donate For Us