Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP Regular expression - Remove all non-alphanumeric characters

I use PHP.

My string can look like this

This is a string-test width åäö and some über+strange characters: _like this?

Question

Is there a way to remove non-alphanumeric characters and replace them with a space? Here are some non-alphanumeric characters:

  • -
  • +
  • :
  • _
  • ?

I've read many threads about it but they don't support other languages, like this one:

preg_replace("/[^A-Za-z0-9 ]/", '', $string);

Requirements

  • My list of none letter characters might not be complete.
  • My content contain characters in different languages, like åäöü. Could be very many more.
  • The non-alphanumeric characters should be replaced with a space. Else the word would be glued to eachother.
like image 758
Jens Törnell Avatar asked May 07 '13 19:05

Jens Törnell


People also ask

How to get rid of all non alphanumeric characters?

To remove all non-alphanumeric characters from a string, call the replace() method, passing it a regular expression that matches all non-alphanumeric characters as the first parameter and an empty string as the second. The replace method returns a new string with all matches replaced. Copied!

How remove all special characters from a string in PHP?

One useful function that can be used to remove special characters from a string is the str_replace() function. The empty string must be used to the replace character in this function to remove the specified character. The syntax of this function is given below. The str_replace() function can take four arguments.

Is alpha numeric PHP?

PHP | ctype_alnum() (Check for Alphanumeric)A ctype_alnum() function in PHP used to check all characters of given string/text are alphanumeric or not. If all characters are alphanumeric then return TRUE, otherwise return FALSE. Parameters Used: $text : It is a mandatory parameter which specifies the string.

What are non alphanumeric characters?

Non-alphanumeric characters are characters that are not numbers (0-9) or alphabetic characters.


1 Answers

You can try this:

preg_replace('~[^\p{L}\p{N}]++~u', ' ', $string);

\p{L} stands for all alphabetic characters (whatever the alphabet).

\p{N} stands for numbers.

With the u modifier characters of the subject string are treated as unicode characters.

Or this:

preg_replace('~\P{Xan}++~u', ' ', $string);

\p{Xan} contains unicode letters and digits.

\P{Xan} contains all that is not unicode letters and digits. (Be careful, it contains white spaces too that you can preserve with ~[^\p{Xan}\s]++~u )

If you want a more specific set of allowed letters you must replace \p{L} with ranges in unicode table.

Example:

preg_replace('~[^a-zÀ-ÖØ-öÿŸ\d]++~ui', ' ', $string);

Why using a possessive quantifier (++) here?

~\P{Xan}+~u will give you the same result as ~\P{Xan}++~u. The difference here is that in the first the engine records each backtracking position (that we don't need) when in the second it doesn't (as in an atomic group). The result is a small performance profit.

I think it's a good practice to use possessive quantifiers and atomic groups when it's possible.

However, the PCRE regex engine makes automatically a quantifier possessive in obvious situations (example: a+b => a++b) except If the PCRE module has been compiled with the option PCRE_NO_AUTO_POSSESS. (http://www.pcre.org/pcre.txt)

More informations about possessive quantifiers and atomic groups here (possessive quantifiers) and here (atomic groups) or here

like image 138
Casimir et Hippolyte Avatar answered Oct 30 '22 10:10

Casimir et Hippolyte