I'm using the following regex to strip out non-printing control characters from user input before inserting the values into the database. <pre class="prettyprint"><code> preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $value) </code></pre> Is there a problem with using this on utf-8 strings? It seems to remove all non-ascii characters entirely.

Part of the problem is that you aren't treating the target as a UTF-8 string; you need the <code>/u</code> modifier for that. Also, in UTF-8 any non-ASCII character is represented by two or more bytes, all of them in the range <code>\x80..\xFF</code>. Try this: <pre class="prettyprint"><code>preg_replace('/\p{Cc}+/u', '', $value) </code></pre> <code>\p{Cc}</code> is the Unicode property for control characters, and the <code>u</code> causes both the regex and the target string to be treated as UTF-8.

You can use Unicode character properties <pre class="prettyprint"><code>preg_replace('/[^\p{L}\s]/u','',$value); </code></pre> (Do add the other classes you want to let through) If you want to revert unicode to ascii, by no means fullproof but with some nice translations: <pre class="prettyprint"><code>echo iconv('utf-8','ascii//translit','éñó'); //prints 'eno' </code></pre>

preg_replace to strip out non-printing characters seems to remove all foreign characters as well

Tags:

regex

php

I'm using the following regex to strip out non-printing control characters from user input before inserting the values into the database.

 preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $value)

Is there a problem with using this on utf-8 strings? It seems to remove all non-ascii characters entirely.

811

asked Jul 20 '10 23:07

Greg

2 Answers

Part of the problem is that you aren't treating the target as a UTF-8 string; you need the /u modifier for that. Also, in UTF-8 any non-ASCII character is represented by two or more bytes, all of them in the range \x80..\xFF. Try this:

preg_replace('/\p{Cc}+/u', '', $value)

\p{Cc} is the Unicode property for control characters, and the u causes both the regex and the target string to be treated as UTF-8.

166

answered Oct 05 '22 06:10

Alan Moore

You can use Unicode character properties

preg_replace('/[^\p{L}\s]/u','',$value);

(Do add the other classes you want to let through)

If you want to revert unicode to ascii, by no means fullproof but with some nice translations:

echo iconv('utf-8','ascii//translit','éñó'); //prints 'eno'

answered Oct 05 '22 07:10

Wrikken

Related questions
                            
                                How to remove values from an array whilst renumbering numeric keys
                            
                                Trying to make a CodeIgniter controller called "List"
                            
                                Should I really be using PDO and prepared statements?
                            
                                Called child´s constant not available in static function in parent
                            
                                Should you always end mysql queries with "or die?"
                            
                                Expanding PHP Markdown to Accept CSS Classnames
                            
                                How to generate short filenames for uploaded photos?
                            
                                Is it possible to make a Category page the homepage in magento?
                            
                                Zend Framework - When to use viewscripts/partials vs view helpers
                            
                                unhandled errors in php
                            
                                Regular Expression to match unlimited number of options
                            
                                Can PHP's PDO be limited to a single query?
                            
                                Zend Framework - counting rows in select clause?
                            
                                SimpleXML SOAP response Namespace issues
                            
                                Kohana 3 ORM - grouping where conditions with parentheses
                            
                                Can I assign an array just by making it equal to another array?
                            
                                PHP string manipulation: Append html class to a string
                            
                                Is (int) and is_int() secure to protect against SQL injections?
                            
                                How to find the transaction is settled/Unsettled in Authorize.net?
                            
                                php xpath: query within a query result

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With