I'm using the following regex to strip out non-printing control characters from user input before inserting the values into the database.
preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $value)
Is there a problem with using this on utf-8 strings? It seems to remove all non-ascii characters entirely.
Using str_replace() Method: The str_replace() method is used to remove all the special characters from the given string str by replacing these characters with the white space (” “).
The preg_replace() function returns a string or array of strings where all matches of a pattern or list of patterns found in the input are replaced with substrings.
The ltrim() function removes whitespace or other predefined characters from the left side of a string. Related functions: rtrim() – Removes whitespace or other predefined characters from the right side of a string. trim() – Removes whitespace or other predefined characters from both sides of a string.
Part of the problem is that you aren't treating the target as a UTF-8 string; you need the /u
modifier for that. Also, in UTF-8 any non-ASCII character is represented by two or more bytes, all of them in the range \x80..\xFF
. Try this:
preg_replace('/\p{Cc}+/u', '', $value)
\p{Cc}
is the Unicode property for control characters, and the u
causes both the regex and the target string to be treated as UTF-8.
You can use Unicode character properties
preg_replace('/[^\p{L}\s]/u','',$value);
(Do add the other classes you want to let through)
If you want to revert unicode to ascii, by no means fullproof but with some nice translations:
echo iconv('utf-8','ascii//translit','éñó'); //prints 'eno'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With