Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is the strcasecmp algorithm flawed?

Tags:

c

strcmp

I am trying to reimplement the strcasecmp function in C and I noticed what appears to be an inconsistency in the comparison process.

From man strcmp

The strcmp() function compares the two strings s1 and s2. The locale is not taken into account (for a locale-aware comparison, see strcoll(3)). It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.

From man strcasecmp

The strcasecmp() function performs a byte-by-byte comparison of the strings s1 and s2, ignoring the case of the characters. It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.

int strcmp(const char *s1, const char *s2);
int strcasecmp(const char *s1, const char *s2);

Given, this information, I don't understand the result of the following code:

#include <stdio.h>
#include <string.h>

int main()
{
    // ASCII values
    // 'A' = 65
    // '_' = 95
    // 'a' = 97

    printf("%i\n", strcmp("A", "_"));
    printf("%i\n", strcmp("a", "_"));
    printf("%i\n", strcasecmp("A", "_"));
    printf("%i\n", strcasecmp("a", "_"));
    return 0;
}

Ouput:

-1  # "A" is less than "_"
1   # "a" is more than "_"
2   # "A" is more than "_" with strcasecmp ???
2   # "a" is more than "_" with strcasecmp

It appears that, if the current character in s1 is a letter, it is always converted to lowercase, regardless of whether the current character in s2 is a letter or not.

Can someone explain this behaviour? Shouldn't the first and third lines be identical?

Thank you in advance!

PS:
I am using gcc 9.2.0 on Manjaro.
Also, when I compile with the -fno-builtin flag I get instead:

-30
2
2
2

I guess it's because the program does not use gcc's optimised functions, but the question remains.

like image 774
Haltarys Avatar asked Feb 21 '20 16:02

Haltarys


People also ask

Is Strcasecmp case sensitive?

The strcasecmp() function compares string1 and string2 without sensitivity to case. All alphabetic characters in string1 and string2 are converted to lowercase before comparison.

Is Strcasecmp a standard?

It depends on strcasecmp() specifications (it is not C standard) . Most strcasecmp() do use this tolower() approach for the default locale.

What is the use of Strcasecmp?

The strcasecmp() function compares two strings. Tip: The strcasecmp() function is binary-safe and case-insensitive. Tip: This function is similar to the strncasecmp() function, with the difference that you can specify the number of characters from each string to be used in the comparison with strncasecmp().

Where is Strcasecmp defined?

In programming language C, strcasecmp is a function declared in the strings. h header file (or sometimes in string. h) that compares two strings irrespective of the case of characters.


3 Answers

The behavior is correct.

Per the POSIX str\[n\]casecmp() specification:

When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.

That is also part of the NOTES section of the Linux man page:

The POSIX.1-2008 standard says of these functions:

When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.

Why?

As @HansOlsson pointed out in his answer, doing case-insensitive comparisons between only letters and allowing all other comparisons to have their "natural" results as done in strcmp() would break sorting.

If 'A' == 'a' (the definition of a case-insensitive comparison) then '_' > 'A' and '_' < 'a' (the "natural" results in the ASCII character set) can not both be true.

like image 185
Andrew Henle Avatar answered Oct 23 '22 23:10

Andrew Henle


Other links, http://man7.org/linux/man-pages/man3/strcasecmp.3p.html for strcasecmp say that converting to lower-case is the correct behavior (at least in POSIX locale).

The reason for that behavior is that if you use strcasecmp to sort an array of strings it is needed to get reasonable results.

Otherwise if you try to sort "A", "C", "_", "b" using e.g., qsort the result would depend on the order of comparisons.

like image 21
Hans Olsson Avatar answered Oct 23 '22 23:10

Hans Olsson


It appears that, if the current character in s1 is a letter, it is always converted to lowercase, regardless of whether the current character in s2 is a letter or not.

That's correct - and it's what the strcasecmp() function should do! It is a POSIX function, rather than part of the C Standard but, from the "The Open Group Base Specifications, Issue 6":

In the POSIX locale, strcasecmp() and strncasecmp() shall behave as if the strings had been converted to lowercase and then a byte comparison performed. The results are unspecified in other locales.

Incidentally, this behaviour is also pertintent to the _stricmp() function (as used in Visual Studio/MSCV):

The _stricmp function ordinally compares string1 and string2 after converting each character to lowercase, and returns a value indicating their relationship.

like image 8
Adrian Mole Avatar answered Oct 23 '22 23:10

Adrian Mole