How can I remove characters, like punctuation, commas, dashes etc from a string, in a multibyte safe manner?
I will be working with input from many different languages and I am wondering if there is something that can help me with this
Thanks
To remove all non-alphanumeric characters from a string, call the replace() method, passing it a regular expression that matches all non-alphanumeric characters as the first parameter and an empty string as the second. The replace method returns a new string with all matches replaced.
The character class \p{Alnum} matches any alphanumeric character.
The approach is to use the String. replaceAll method to replace all the non-alphanumeric characters with an empty string.
There are the unicode character class thingys that you can use:
To match any non-letter symbols you can just use \PL+
, the negation of \p{L}
. To not remove spaces, use a charclass like [^\pL\s]+
. Or really just remove punctuation with \pP+
Well, and obviously don't forget the regex /u
modifier.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With