I thought strcmp was supposed to return a positive number if the first string was larger than the second string. But this program
#include <stdio.h>
#include <string.h>
int main()
{
char A[] = "A";
char Aumlaut[] = "Ä";
printf("%i\n", A[0]);
printf("%i\n", Aumlaut[0]);
printf("%i\n", strcmp(A, Aumlaut));
return 0;
}
prints 65, -61 and -1.
Why? Is there something I'm overlooking?
I thought that maybe the fact that I'm saving as UTF-8 would influence things.. You know because the Ä consists of 2 chars there. But saving as an 8-bit encoding and making sure that the strings both have length 1 doesn't help, the end result is the same.
What am I doing wrong?
Using GCC 4.3 under 32 bit Linux here, in case that matters.
strcmp and the other string functions aren't actually utf aware. On most posix machines, C/C++ char is internally utf8, which makes most things "just work" with regards to reading and writing and provide the option of a library understanding and manipulating the utf codepoints. But the default string.h functions are not culture sensitive and do not know anything about comparing utf strings. You can look at the source code for strcmp and see for yourself, it's about as naïve an implementation as possible (which means it's also faster than an internationalization-aware compare function).
I just answered this in another question - you need to use a UTF-aware string library such as IBM's excellent ICU - International Components for Unicode.
The strcmp and similar comparison functions treat the bytes in the strings as unsigned chars, as specified by the standard in section 7.24.4, point 1 (was 7.21.4 in C99)
The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.
(emphasis mine).
The reason is probably that such an interpretation maintains the ordering between code points in the common encodings, while interpreting them a s signed chars doesn't.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With