How does .NET determine the Unicode category of a character?

Tags:

I was looking in mscorelib.dll with .NET Reflector, and stumbled upon the Char class. I always wondered how methods like Char.isLetter was done. I expected a huge list of test, but, buy digging a little bit, I found a really short code that determine the Unicode category. However, this code uses some kind of tables and some bitshifting voodoo. Can anyone explain to me how this is done, or point me to some resources?

EDIT : Here's the code. It's in System.Globalization.CharUnicodeInfo.

internal static unsafe byte InternalGetCategoryValue(int ch, int offset)
{
    ushort num = s_pCategoryLevel1Index[ch >> 8];
    num = s_pCategoryLevel1Index[num + ((ch >> 4) & 15)];
    byte* numPtr = (byte*) (s_pCategoryLevel1Index + num);
    byte num2 = numPtr[ch & 15];
    return s_pCategoriesValue[(num2 * 2) + offset];
}

s_pCategoryLevel1Index is a short* and s_pCategoryValues is a byte*

Both are created in the CharUnicodeInfo static constructor :

 static unsafe CharUnicodeInfo()
{
    s_pDataTable = GlobalizationAssembly.GetGlobalizationResourceBytePtr(typeof(CharUnicodeInfo).Assembly, "charinfo.nlp");
    UnicodeDataHeader* headerPtr = (UnicodeDataHeader*) s_pDataTable;
    s_pCategoryLevel1Index = (ushort*) (s_pDataTable + headerPtr->OffsetToCategoriesIndex);
    s_pCategoriesValue = s_pDataTable + ((byte*) headerPtr->OffsetToCategoriesValue);
    s_pNumericLevel1Index = (ushort*) (s_pDataTable + headerPtr->OffsetToNumbericIndex);
    s_pNumericValues = s_pDataTable + ((byte*) headerPtr->OffsetToNumbericValue);
    s_pDigitValues = (DigitValues*) (s_pDataTable + headerPtr->OffsetToDigitValue);
    nativeInitTable(s_pDataTable);
}

Here is the UnicodeDataHeader.

internal struct UnicodeDataHeader
{
    // Fields
    [FieldOffset(40)]
    internal uint OffsetToCategoriesIndex;
    [FieldOffset(0x2c)]
    internal uint OffsetToCategoriesValue;
    [FieldOffset(0x34)]
    internal uint OffsetToDigitValue;
    [FieldOffset(0x30)]
    internal uint OffsetToNumbericIndex;
    [FieldOffset(0x38)]
    internal uint OffsetToNumbericValue;
    [FieldOffset(0)]
    internal char TableName;
    [FieldOffset(0x20)]
    internal ushort version;
}

Note : I Hope this doesn't break any licence. If so, I'll remove the code.

380

asked Feb 25 '11 03:02

subb

1 Answers

The basic information is stored in charinfo.nlp which is embedded in mscorlib.dll as a resource and loaded at runtime. The specifics of the file are probably only known to Microsoft but suffice it to say that it probably is a lookup table in a fashion.

EDIT

According to MSDN:

This enumeration is based on The Unicode Standard, version 5.0. For more information, see the "UCD File Format" and "General Category Values" subtopics at the Unicode Character Database.

171

answered Oct 30 '22 21:10

Chris Haas

Related questions
                            
                                C#: catch test failure during classcleanup
                            
                                Resize my border when a VerticalScrollBar appear
                            
                                Why is this string-based Contract.Ensure call unproven?
                            
                                Getting quotient of two BigIntegers as double
                            
                                Status of "Synchronization Domain" technology
                            
                                Calling a C++ dll (unmanaged code) from a C# windows service (written in managed code)
                            
                                What should I return from service layers or almost any method
                            
                                .NET 3.5 C# Bug with System.Timer System.ObjectDisposedException: Cannot access a disposed object
                            
                                Reading Stream to a MemoryStream in multiple threads
                            
                                How can i get photos from albums? facebook c# sdk
                            
                                Converting HTML table to excel sheet using C#
                            
                                WCF ResponseFormat JSON Returns Json in Fiddler, Xml in Chrome/Firefox!
                            
                                C# NET HTTP.SYS web server
                            
                                Incorrect behavior of Panel on inherited Windows Form?
                            
                                How to convert this HQL to DetachedCriteria?
                            
                                C# Class Design - Events & long running methods
                            
                                convert byte[] of jp2 to jpg file
                            
                                Architectural concerns: Fluent NHibernate, The Repository pattern and ASP.NET MVC
                            
                                Namespace cannot be found after changing target framework from v4.0 to v3.5
                            
                                Multiple/Single *.edmx files per database

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does .NET determine the Unicode category of a character?

Tags:

c#

.net

unicode

character

subb

People also ask

1 Answers

Chris Haas

Recent Activity

Donate For Us