I use PHP.
My string can look like this
This is a string-test width åäö and some über+strange characters: _like this?
Question
Is there a way to remove non-alphanumeric characters and replace them with a space? Here are some non-alphanumeric characters:
I've read many threads about it but they don't support other languages, like this one:
preg_replace("/[^A-Za-z0-9 ]/", '', $string);
Requirements
To remove all non-alphanumeric characters from a string, call the replace() method, passing it a regular expression that matches all non-alphanumeric characters as the first parameter and an empty string as the second. The replace method returns a new string with all matches replaced. Copied!
One useful function that can be used to remove special characters from a string is the str_replace() function. The empty string must be used to the replace character in this function to remove the specified character. The syntax of this function is given below. The str_replace() function can take four arguments.
PHP | ctype_alnum() (Check for Alphanumeric)A ctype_alnum() function in PHP used to check all characters of given string/text are alphanumeric or not. If all characters are alphanumeric then return TRUE, otherwise return FALSE. Parameters Used: $text : It is a mandatory parameter which specifies the string.
Non-alphanumeric characters are characters that are not numbers (0-9) or alphabetic characters.
You can try this:
preg_replace('~[^\p{L}\p{N}]++~u', ' ', $string);
\p{L}
stands for all alphabetic characters (whatever the alphabet).
\p{N}
stands for numbers.
With the u modifier characters of the subject string are treated as unicode characters.
Or this:
preg_replace('~\P{Xan}++~u', ' ', $string);
\p{Xan}
contains unicode letters and digits.
\P{Xan}
contains all that is not unicode letters and digits. (Be careful, it contains white spaces too that you can preserve with ~[^\p{Xan}\s]++~u
)
If you want a more specific set of allowed letters you must replace \p{L}
with ranges in unicode table.
Example:
preg_replace('~[^a-zÀ-ÖØ-öÿŸ\d]++~ui', ' ', $string);
Why using a possessive quantifier (++) here?
~\P{Xan}+~u
will give you the same result as ~\P{Xan}++~u
. The difference here is that in the first the engine records each backtracking position (that we don't need) when in the second it doesn't (as in an atomic group). The result is a small performance profit.
I think it's a good practice to use possessive quantifiers and atomic groups when it's possible.
However, the PCRE regex engine makes automatically a quantifier possessive in obvious situations (example: a+b
=> a++b
) except If the PCRE module has been compiled with the option PCRE_NO_AUTO_POSSESS. (http://www.pcre.org/pcre.txt)
More informations about possessive quantifiers and atomic groups here (possessive quantifiers) and here (atomic groups) or here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With