I am trying to reimplement the strcasecmp
function in C and I noticed what appears to be an inconsistency in the comparison process.
From man strcmp
The strcmp() function compares the two strings s1 and s2. The locale is not taken into account (for a locale-aware comparison, see strcoll(3)). It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.
From man strcasecmp
The strcasecmp() function performs a byte-by-byte comparison of the strings s1 and s2, ignoring the case of the characters. It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.
int strcmp(const char *s1, const char *s2);
int strcasecmp(const char *s1, const char *s2);
Given, this information, I don't understand the result of the following code:
#include <stdio.h>
#include <string.h>
int main()
{
// ASCII values
// 'A' = 65
// '_' = 95
// 'a' = 97
printf("%i\n", strcmp("A", "_"));
printf("%i\n", strcmp("a", "_"));
printf("%i\n", strcasecmp("A", "_"));
printf("%i\n", strcasecmp("a", "_"));
return 0;
}
Ouput:
-1 # "A" is less than "_"
1 # "a" is more than "_"
2 # "A" is more than "_" with strcasecmp ???
2 # "a" is more than "_" with strcasecmp
It appears that, if the current character in s1
is a letter, it is always converted to lowercase, regardless of whether the current character in s2
is a letter or not.
Can someone explain this behaviour? Shouldn't the first and third lines be identical?
Thank you in advance!
PS:
I am using gcc 9.2.0
on Manjaro.
Also, when I compile with the -fno-builtin
flag I get instead:
-30
2
2
2
I guess it's because the program does not use gcc's optimised functions, but the question remains.
The strcasecmp() function compares string1 and string2 without sensitivity to case. All alphabetic characters in string1 and string2 are converted to lowercase before comparison.
It depends on strcasecmp() specifications (it is not C standard) . Most strcasecmp() do use this tolower() approach for the default locale.
The strcasecmp() function compares two strings. Tip: The strcasecmp() function is binary-safe and case-insensitive. Tip: This function is similar to the strncasecmp() function, with the difference that you can specify the number of characters from each string to be used in the comparison with strncasecmp().
In programming language C, strcasecmp is a function declared in the strings. h header file (or sometimes in string. h) that compares two strings irrespective of the case of characters.
The behavior is correct.
Per the POSIX str\[n\]casecmp()
specification:
When the
LC_CTYPE
category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.
That is also part of the NOTES section of the Linux man page:
The POSIX.1-2008 standard says of these functions:
When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.
Why?
As @HansOlsson pointed out in his answer, doing case-insensitive comparisons between only letters and allowing all other comparisons to have their "natural" results as done in strcmp()
would break sorting.
If 'A' == 'a'
(the definition of a case-insensitive comparison) then '_' > 'A'
and '_' < 'a'
(the "natural" results in the ASCII character set) can not both be true.
Other links, http://man7.org/linux/man-pages/man3/strcasecmp.3p.html for strcasecmp say that converting to lower-case is the correct behavior (at least in POSIX locale).
The reason for that behavior is that if you use strcasecmp to sort an array of strings it is needed to get reasonable results.
Otherwise if you try to sort "A", "C", "_", "b" using e.g., qsort the result would depend on the order of comparisons.
It appears that, if the current character in s1 is a letter, it is always converted to lowercase, regardless of whether the current character in s2 is a letter or not.
That's correct - and it's what the strcasecmp()
function should do! It is a POSIX
function, rather than part of the C
Standard but, from the "The Open Group Base Specifications, Issue 6":
In the POSIX locale, strcasecmp() and strncasecmp() shall behave as if the strings had been converted to lowercase and then a byte comparison performed. The results are unspecified in other locales.
Incidentally, this behaviour is also pertintent to the _stricmp()
function (as used in Visual Studio/MSCV):
The _stricmp function ordinally compares string1 and string2 after converting each character to lowercase, and returns a value indicating their relationship.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With