Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does .NET determine the Unicode category of a character?

I was looking in mscorelib.dll with .NET Reflector, and stumbled upon the Char class. I always wondered how methods like Char.isLetter was done. I expected a huge list of test, but, buy digging a little bit, I found a really short code that determine the Unicode category. However, this code uses some kind of tables and some bitshifting voodoo. Can anyone explain to me how this is done, or point me to some resources?

EDIT : Here's the code. It's in System.Globalization.CharUnicodeInfo.

internal static unsafe byte InternalGetCategoryValue(int ch, int offset)
{
    ushort num = s_pCategoryLevel1Index[ch >> 8];
    num = s_pCategoryLevel1Index[num + ((ch >> 4) & 15)];
    byte* numPtr = (byte*) (s_pCategoryLevel1Index + num);
    byte num2 = numPtr[ch & 15];
    return s_pCategoriesValue[(num2 * 2) + offset];
}

s_pCategoryLevel1Index is a short* and s_pCategoryValues is a byte*

Both are created in the CharUnicodeInfo static constructor :

 static unsafe CharUnicodeInfo()
{
    s_pDataTable = GlobalizationAssembly.GetGlobalizationResourceBytePtr(typeof(CharUnicodeInfo).Assembly, "charinfo.nlp");
    UnicodeDataHeader* headerPtr = (UnicodeDataHeader*) s_pDataTable;
    s_pCategoryLevel1Index = (ushort*) (s_pDataTable + headerPtr->OffsetToCategoriesIndex);
    s_pCategoriesValue = s_pDataTable + ((byte*) headerPtr->OffsetToCategoriesValue);
    s_pNumericLevel1Index = (ushort*) (s_pDataTable + headerPtr->OffsetToNumbericIndex);
    s_pNumericValues = s_pDataTable + ((byte*) headerPtr->OffsetToNumbericValue);
    s_pDigitValues = (DigitValues*) (s_pDataTable + headerPtr->OffsetToDigitValue);
    nativeInitTable(s_pDataTable);
}

Here is the UnicodeDataHeader.

internal struct UnicodeDataHeader
{
    // Fields
    [FieldOffset(40)]
    internal uint OffsetToCategoriesIndex;
    [FieldOffset(0x2c)]
    internal uint OffsetToCategoriesValue;
    [FieldOffset(0x34)]
    internal uint OffsetToDigitValue;
    [FieldOffset(0x30)]
    internal uint OffsetToNumbericIndex;
    [FieldOffset(0x38)]
    internal uint OffsetToNumbericValue;
    [FieldOffset(0)]
    internal char TableName;
    [FieldOffset(0x20)]
    internal ushort version;
}

Note : I Hope this doesn't break any licence. If so, I'll remove the code.

like image 380
subb Avatar asked Feb 25 '11 03:02

subb


People also ask

How do you find the Unicode value of a character?

We can determine the unicode category for a particular character by using the getType() method. It is a static method of Character class and it returns an integer value of char ch representing in unicode general category.

Is .NET a Unicode?

16-bit Unicode Transformation Format (UTF-16) is a character encoding system that uses 16-bit code units to represent Unicode code points. . NET uses UTF-16 to encode the text in a string . A char instance represents a 16-bit code unit.

What is Unicode general category?

A Unicode general category defines the broad classification of a character, that is, designation as a type of letter, decimal digit, separator, mathematical symbol, punctuation, and so on. This enumeration is based on The Unicode Standard, version 5.0.

Is C# string Unicode?

The equivalent in C# is the String class. According to MSDN: (A String) Represents text as a series of Unicode characters. So, if you do string str = "a string here"; , you have a Unicode string.


1 Answers

The basic information is stored in charinfo.nlp which is embedded in mscorlib.dll as a resource and loaded at runtime. The specifics of the file are probably only known to Microsoft but suffice it to say that it probably is a lookup table in a fashion.

EDIT

According to MSDN:

This enumeration is based on The Unicode Standard, version 5.0. For more information, see the "UCD File Format" and "General Category Values" subtopics at the Unicode Character Database.

like image 171
Chris Haas Avatar answered Oct 30 '22 21:10

Chris Haas