I need to filter punctuation from UTF-8 strings quickly in C. The strings could be long and they are quite numerous. The function I'm using currently seems very inefficient:
char *filter(char *mystring){
char *p;
while ((p = strchr(mystring,'.')) != NULL)
strcpy(p, p+1);
while ((p = strchr(mystring,',')) != NULL)
...etc etc etc...
...etc...
return mystring;
}
As you can see it iterates through the string for each punctuation mark. Is there a simple library function that can complete this efficiently for all punctuation marks?
A more efficient algorithm is:
#include <ctype.h>
char *filter(char *mystring)
{
char *in = mystring;
char *out = mystring;
do {
if (!ispunct(*in))
*out++ = *in;
} while (*in++);
return mystring;
}
It isn't UTF-8 specific though - it's whatever the current locale is. (Your original wasn't UTF-8 specific, either).
If you wish to make it UTF-8, you could replace ispunct()
with a function that will take a char *
and determine if it starts with a (potentially multi-byte) UTF-8 character that's some kind of punctuation mark (and call it with in
instead of *in
).
The ICU libraries have C bindings, and include a regex library that correctly handles Unicode \pP
punctuation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With