Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to remove non alphanumeric characters from UTF8 strings

Tags:

regex

php

How can I remove characters, like punctuation, commas, dashes etc from a string, in a multibyte safe manner?

I will be working with input from many different languages and I am wondering if there is something that can help me with this

Thanks

like image 219
Thomas Avatar asked Dec 01 '11 20:12

Thomas


People also ask

How do I remove all non-alphanumeric characters from a string?

To remove all non-alphanumeric characters from a string, call the replace() method, passing it a regular expression that matches all non-alphanumeric characters as the first parameter and an empty string as the second. The replace method returns a new string with all matches replaced.

What is Alnum in regex?

The character class \p{Alnum} matches any alphanumeric character.

How do you remove non alphabetic characters from a string in Java?

The approach is to use the String. replaceAll method to replace all the non-alphanumeric characters with an empty string.


1 Answers

There are the unicode character class thingys that you can use:

  • http://www.regular-expressions.info/unicode.html
  • http://php.net/manual/en/regexp.reference.unicode.php

To match any non-letter symbols you can just use \PL+, the negation of \p{L}. To not remove spaces, use a charclass like [^\pL\s]+. Or really just remove punctuation with \pP+

Well, and obviously don't forget the regex /u modifier.

like image 154
mario Avatar answered Sep 19 '22 15:09

mario