Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When returning the difference between pointers of char strings, how important is the order of casting and dereferencing?

For educational purposes (yes 42 yes) I'm rewriting strncmp and a classmate just came up to me asking why I was casting my returnvalues in such a way. My suggestion was to typecast first and dereference afterwards. My logic was that I wanted to treat the char string as an unsigned char string and dereference it as such.

int strncmp(const char *s1, const char *s2, size_t n)
{
    if (n == 0)
        return (0);
    while (*s1 == *s2 && *s1 && n > 1)
    {
        n--;
        s1++;
        s2++;
    }
    return (*(unsigned char *)s1 - *(unsigned char *)s2);
}

His was to dereference first and to typecast afterwards in order to make absolutely sure it returns the difference between two unsigned chars. Like this:

return ((unsigned char)*s1 - (unsigned char)*s2);

Following the discussion (and me agreeing with him my casting is weird) we looked up some source code of production-ready implementations and to to our surprise Apple seems to cast/dereference in the same order as I do:

https://opensource.apple.com/source/Libc/Libc-167/gen.subproj/i386.subproj/strncmp.c.auto.html

Therefore the question: what is the difference in this case? And why choose one over the other?

(I've already found the following; but it specifies the casting/dereferencing of datatypes of different sizes whereas in the case of chars/unsigned chars it shouldn't matter right?

In C, if I cast & dereference a pointer, does it matter which one I do first? )

like image 855
wandawata Avatar asked Nov 18 '19 23:11

wandawata


2 Answers

On a two's complement system (which is pretty much all of them), it won't make a difference.

The first example--*(unsigned char *)x-- will simply interpret the binary value of the data stored at the location as an unsigned char, so if the decimal value stored at the location is -1, then hex value (assuming CHAR_BIT=8) stored is 0xFF and then it will be simply be interpreted as 255 as it fits the hex representation.

The second example (assuming char is signed on this compiler)--(unsigned char)*x-- will first grab the value stored at the location and then cast it to unsigned. So we get -1 and in casting it to unsigned char, the standard states that to translate a negative signed number to an unsigned value, you add one more than the max value storable by that type to the negative value as much as necessary until you have a value within its range. So you get -1 + 256 = 255

However, if you somehow were on a one's complement system, things go a bit differently.

Again, using *(unsigned char *)x, we reinterpret the hex representation of -1 as an unsigned char, but this time the hex representation is 0xFE, which will be interpreted as 254 rather than 255.

Going back to (unsigned char)*x, it will still just take take perform the -1 + 256 to get the end result of 255.

All that said, I'm not sure if the 8th bit of a char can be used by a character encoding by the C standard. I know it's not used in ASCII-encoded strings, which again is what you will most likely be working with, so you likely won't come across any negative values when comparing actual strings.


Converting from signed to unsigned can be found in the C11 standard at section 6.3.1.3:

  1. When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged.

  2. Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type.

like image 61
Christian Gibbons Avatar answered Oct 08 '22 22:10

Christian Gibbons


And why choose one over the other?

The below makes a difference with non 2's complement in an interesting way.

// #1
return (*(unsigned char *)s1 - *(unsigned char *)s2);
// *2
return ((unsigned char)*s1 - (unsigned char)*s2);

Integer non-2's complement encoding (all but extinct theses days), had a bit-pattern that was either -0 or a trap representation.

If code used (unsigned char)*s1 when s1 pointed to such, either the -0 would become a sign-less 0 or a trap could happen.

With -0 becoming an unsigned char, that would lose arithmetic distinction from a null character - the character at the end of a stings.
In C, a null character is a "byte with all bits set to 0".

To prevent that, (*(unsigned char *)s1 is used.

C requires it:

7.24.1 String function conventions
For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value). C17dr § 7.24.1.3

To that end, OP's code has a bug. With non-2's compliment, *s1 should not stop the loop as a -0.

// while (*s1 == *s2 && *s1 && n > 1)
while ((*(unsigned char *)s1 == (*(unsigned char *)s2 && (*(unsigned char *)s1 && n > 1)

For the pedantic, a char may be the same size as an int. Some graphics processors have done this. In such cases, to prevent overflow, the following can be used. Works for the usual 8-bit char too.

// return (*(unsigned char *)s1 - *(unsigned char *)s2);
return (*(unsigned char *)s1 > *(unsigned char *)s2) - 
       (*(unsigned char *)s1 < *(unsigned char *)s2);

Alternative

int strncmp(const char *s1, const char *s2, size_t n) {
  const unsigned char *u1 = (const unsigned char *) s1;
  const unsigned char *u2 = (const unsigned char *) s2;
  if (n == 0) {
      return (0);
  }
  while (*u1 == *u2 && *u1 && n > 1) {
      n--;
      u1++;
      u2++;
  }
  return (*u1 > *u2) - (*u1 < *u2);
}
like image 44
chux - Reinstate Monica Avatar answered Oct 09 '22 00:10

chux - Reinstate Monica