Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove all non alphabet characters from a string in C -- possible compiler issue

I'm writing a function in C that will take a string and remove all characters that are not a lowercase alphabetic character. I have this code written so far:

void strclean(char* str) {
   while (*str) {
      if (!(*str >= 'a' && *str <= 'z')) {
         strcpy(str, str + 1);
         str--;
      }
      str++;
   }
}

When I pass it the string "hello[][]world" the function seems to mostly work, except the output is:

hellowoldd

When I make it print after every line that it enters the if statement, here is the output I receive:

hello][]woldd
hello[]woldd
hello]woldd
hellowoldd

It seems to be really close but I can't understand why it would be producing this output! The weirdest part is I have given the code to two other friends and it works fine on their computers. We are all running the same version of Linux (ubuntu 14.04.3), and are all using gcc to compile.

I'm not sure if there is an issue with the code that would cause inconsistency of the output, or if it's a compiler issue that's creating the problem. Maybe it has to do with strcpy on my machine compared to theirs?

like image 630
Chris Avatar asked Nov 16 '25 05:11

Chris


2 Answers

The strcpy function is not guaranteed to work if the ranges overlap, as they do in your case. From C11 7.24.2.3 The strcpy function /2 (my emphasis):

The strcpy function copies the string pointed to by s2 (including the terminating null character) into the array pointed to by s1. If copying takes place between objects that overlap, the behavior is undefined.

You can use something like memmove, which does work with overlapping ranges, as per C11 7.24.2.2 The memmove function /2:

The memmove function copies n characters from the object pointed to by s2 into the object pointed to by s1. Copying takes place as if the n characters from the object pointed to by s2 are first copied into a temporary array of n characters that does not overlap the objects pointed to by s1 and s2, and then the n characters from the temporary array are copied into the object pointed to by s1.


But there's a better solution that's O(n) rather than O(n2) in time complexity, while still being overlap-safe:

void strclean (char* src) {
    // Run two pointers in parallel.

    char *dst = src;

    // Process every source character.

    while (*src) {
        // Only copy (and update destination pointer) if suitable.
        // Update source pointer always.

        if (islower(*src)) *dst++ = *src;
        src++;
    }

    // Finalise destination string.

    *dst = '\0';
}

You'll notice I'm also using islower() (from ctype.h) to detect lower case alphabetic characters. This is more portable since the C standard does not mandate that the alpha characters have consecutive code points (the digits are the only ones guaranteed to be consecutive).

There's also no separate need to check for isalpha() since, as per C11 7.4.1.2 The isalpha function /2, islower() == true implies isalpha() == true:

The isalpha function tests for any character for which isupper or islower is true, or ...

like image 136
paxdiablo Avatar answered Nov 18 '25 19:11

paxdiablo


from N1256 7.21.2.3 The strcpy function

If copying takes place between objects that overlap, the behavior is undefined.

memmove can be used even if the area is overlapped.

void strclean(char* str) {
   while (*str) {
      if (!islower(*str)) { /* include ctype.h to use islower function */
         memmove(str, str + 1, strlen(str)); /* strlen(str + 1) + 1 (for terminating null character) should be strlen(str) */
      } else {
         str++;
      }
   }
}

Since it's undefined behaviour to subtract from a pointer so that it points to an area before the array, I've also restructured the str manipulation.

like image 35
MikeCAT Avatar answered Nov 18 '25 19:11

MikeCAT



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!