So I am removing control characters (tab, cr, lf, \v and all other invisible chars) in the client side (after input) but since the client cannot be trusted, I have to remove them in the server too.
so according to this link http://www.utf8-chartable.de/
the control characters are from x00 to 1F and from 7F to 9F. thus my client (javascript) control char removal function is:
return s.replace(/[\x00-\x1F\x7F-\x9F]/g, "");
and my php (server) control char removal function is:
$s = preg_replace('/[\x00-\x1F\x7F-\x9F]/', '', $s);
Now this seems to create problems with international utf8 chars such as ς (xCF x82) in PHP only (because x82 is inside the second sequence group), the javascript equivalent does not create any problems.
Now my question is, should I remove the control characters from 7F to 9F? To my understanding those the sequences from 127 to 159 (7F to 9F) obviously can be part of a valid UTF-8 string?
also, maybe I shouldn't even filter the 00 to 31 control characters because also some of those characters can appear in some weird (japanese? chinese?) but valid utf-8 characters ?
$string = preg_replace('/[\x00-\x1F\x7F-\xFF]/', '', $string); It matches anything in range 0-31, 127-255 and removes it.
Control characters are e.g. line feed, tab, escape.
Explanation: In PHP to remove characters from beginning we can use ltrim but in that we have to define what we want to remove from a string i.e. removing characters are to be known. $str = "geeks" ; // Or we can write ltrim($str, $str[0]);
1 Answer. Show activity on this post. function clean($string) { $string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens. return preg_replace('/[^A-Za-z\-]/', '', $string); // Removes special chars. }
it seems that I just need to add the u flag to the regex thus it becomes:
$s = preg_replace('/[\x00-\x1F\x7F-\x9F]/u', '', $s);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With