Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to filter punctuation in C

Tags:

c

filtering

I need to filter punctuation from UTF-8 strings quickly in C. The strings could be long and they are quite numerous. The function I'm using currently seems very inefficient:

char *filter(char *mystring){
    char *p;
    while ((p = strchr(mystring,'.')) != NULL)
        strcpy(p, p+1);
    while ((p = strchr(mystring,',')) != NULL)
        ...etc etc etc...
    ...etc...
    return mystring;
}

As you can see it iterates through the string for each punctuation mark. Is there a simple library function that can complete this efficiently for all punctuation marks?

like image 402
KeatsKelleher Avatar asked Dec 29 '22 05:12

KeatsKelleher


2 Answers

A more efficient algorithm is:

#include <ctype.h>

char *filter(char *mystring)
{
    char *in = mystring;
    char *out = mystring;

    do {
        if (!ispunct(*in))
            *out++ = *in;
    } while (*in++);

    return mystring;
}

It isn't UTF-8 specific though - it's whatever the current locale is. (Your original wasn't UTF-8 specific, either).

If you wish to make it UTF-8, you could replace ispunct() with a function that will take a char * and determine if it starts with a (potentially multi-byte) UTF-8 character that's some kind of punctuation mark (and call it with in instead of *in).

like image 97
caf Avatar answered Jan 09 '23 03:01

caf


The ICU libraries have C bindings, and include a regex library that correctly handles Unicode \pP punctuation.

like image 45
tchrist Avatar answered Jan 09 '23 03:01

tchrist