Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

strcmp returning unexpected results

Tags:

c

char

strcmp

I thought strcmp was supposed to return a positive number if the first string was larger than the second string. But this program

#include <stdio.h>
#include <string.h>

int main()
{
    char A[] = "A";
    char Aumlaut[] = "Ä";
    printf("%i\n", A[0]);
    printf("%i\n", Aumlaut[0]);
    printf("%i\n", strcmp(A, Aumlaut));
    return 0;
}

prints 65, -61 and -1.

Why? Is there something I'm overlooking?
I thought that maybe the fact that I'm saving as UTF-8 would influence things.. You know because the Ä consists of 2 chars there. But saving as an 8-bit encoding and making sure that the strings both have length 1 doesn't help, the end result is the same.
What am I doing wrong?

Using GCC 4.3 under 32 bit Linux here, in case that matters.

like image 948
Mr Lister Avatar asked Feb 06 '26 13:02

Mr Lister


2 Answers

strcmp and the other string functions aren't actually utf aware. On most posix machines, C/C++ char is internally utf8, which makes most things "just work" with regards to reading and writing and provide the option of a library understanding and manipulating the utf codepoints. But the default string.h functions are not culture sensitive and do not know anything about comparing utf strings. You can look at the source code for strcmp and see for yourself, it's about as naïve an implementation as possible (which means it's also faster than an internationalization-aware compare function).

I just answered this in another question - you need to use a UTF-aware string library such as IBM's excellent ICU - International Components for Unicode.

like image 118
Mahmoud Al-Qudsi Avatar answered Feb 09 '26 05:02

Mahmoud Al-Qudsi


The strcmp and similar comparison functions treat the bytes in the strings as unsigned chars, as specified by the standard in section 7.24.4, point 1 (was 7.21.4 in C99)

The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.

(emphasis mine).

The reason is probably that such an interpretation maintains the ordering between code points in the common encodings, while interpreting them a s signed chars doesn't.

like image 28
Daniel Fischer Avatar answered Feb 09 '26 04:02

Daniel Fischer