Comparing Unicode Strings in C Returns Different Values Than C#

Question

So I am attempting to write a compare function in C which can take a UTF-8 encoded Unicode string and use the Windows CompareStringEx() function and I am expecting it to work just like .NET's CultureInfo.CompareInfo.Compare().

Now the function I have written in C works some of the time, but not in all cases and I'm trying to figure out why. Here is a case that fails (passes in C#, not in C):

CultureInfo cultureInfo = new CultureInfo("en-US");
CompareOptions compareOptions = CompareOptions.IgnoreCase | CompareOptions.IgnoreKanaType | CompareOptions.IgnoreWidth;

string stringA = "คนอ้วน ๆ";
string stringB = "はじめまして";
//Result is -1 which is expected
int result = cultureInfo.CompareInfo.Compare(stringA, stringB);

And here is what I have written in C. Keep in mind this is supposed to take a UTF-8 encoded string and use the Windows CompareStringEx() function so conversion is necessary.

// Compare flags for the string comparison
#define COMPARE_STRING_FLAGS (NORM_IGNORECASE | NORM_IGNOREKANATYPE | NORM_IGNOREWIDTH)

int CompareStrings(int lenA, const void *strA, int lenB, const void *strB) 
{
    LCID ENGLISH_LCID = MAKELCID(MAKELANGID(LANG_ENGLISH, SUBLANG_ENGLISH_US), SORT_DEFAULT);
    int compareString = -1;

    // Get the size of the strings as UTF-18 encoded Unicode strings. 
    // Note: Passing 0 as the last parameter forces the MultiByteToWideChar function
    // to give us the required buffer size to convert the given string to utf-16s
    int strAWStrBufferSize = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strA, lenA, NULL, 0);
    int strBWStrBufferSize = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strB, lenB, NULL, 0);

    // Malloc the strings to store the converted UTF-16 values
    LPWSTR utf16StrA = (LPWSTR) GlobalAlloc(GMEM_FIXED, strAWStrBufferSize * sizeof(WCHAR));
    LPWSTR utf16StrB = (LPWSTR) GlobalAlloc(GMEM_FIXED, strBWStrBufferSize * sizeof(WCHAR));

    // Convert the UTF-8 strings (SQLite will pass them as UTF-8 to us) to standard  
    // windows WCHAR (UTF-16\UCS-2) encoding for Unicode so they can be used in the 
    // Windows CompareStringEx() function.
    if(strAWStrBufferSize != 0)
    {
        MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strA, lenA, utf16StrA, strAWStrBufferSize);
    }
    if(strBWStrBufferSize != 0)
    {
        MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strB, lenB, utf16StrB, strBWStrBufferSize);
    }

    // Compare the strings using the windows compare function.
    // Note: We subtract 1 from the size since we don't want to include the null termination character
    if(NULL != utf16StrA && NULL != utf16StrB)
    {
        compareValue = CompareStringEx(L"en-US", COMPARE_STRING_FLAGS, utf16StrA, strAWStrBufferSize - 1, utf16StrB, strBWStrBufferSize - 1, NULL, NULL, 0);
    }

    // In the Windows CompareStringEx() function, 0 indicates an error, 1 indicates less than, 
    // 2 indicates equal to, 3 indicates greater than so subtract 2 to maintain C convention
    if(compareValue > 0)
    {
        compareValue -= 2;
    }

    return compareValue;
}

Now if I run the following code, I expect the result to be -1 based on the .NET implementation (see above) but I get 1 indicating that the strings are greater than:

char strA[50] = "คนอ้วน ๆ";
char strB[50] = "はじめまして";

// Will be 1 when we expect it to be -1
int result = CompareStrings(strlen(strA), strA, strlen(strB), strB);

Any ideas on why the results I'm getting are different? I'm using the same LCID/cultureInfo and compareOptions in both implementations and the conversions are successful as far as I can tell.

FYI: This function will be used as a custom collation in SQLite. Not relevant to the question but in case anyone is wondering why the function signature is the way it is.

UPDATE: I also determined that when running the same code in .NET 4 I would see the behavior I saw in the native code. As a result there was now a discrepancy between .NET versions. See my answer below for the reasons behind this.

Jon Skeet · Accepted Answer

Well, your code performs several steps here - it's not clear whether it's the compare step which is failing or not.

As a first step, I would write out - in both the .NET code and the C code - the exact UTF-16 code units which you've got in utf16StrA, utf16StrB, stringA and stringB. I wouldn't be at all surprised to find that there's a problem in the input data you're using in the C code.

Comparing Unicode Strings in C Returns Different Values Than C#

Tags:

c

c#

unicode

utf-8

string-comparison

Ian Dallas

1 Answers

Jon Skeet

Recent Activity

Donate For Us

Comparing Unicode Strings in C Returns Different Values Than C#

Tags:

c

c#

unicode

utf-8

string-comparison

Ian Dallas

1 Answers

Jon Skeet

Related questions

Recent Activity

Donate For Us