I'm looking for a general strategy/advice on how to handle invalid UTF-8 input from users. Even though my web application uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP's json_encode() and overall seems like a bad idea to have around. W3C I18N FAQ: Multilingual Forms says "If non-UTF-8 data is received, an error message should be sent back.". <ul> <li>How exactly should this be practically done, throughout a site with dozens of different places where data can be input?</li> <li>How do you present the error in a helpful way to the user?</li> <li>How do you temporarily store and display bad form data so the user doesn't lose all their text? Strip bad characters? Use a replacement character, and how?</li> <li>For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?</li> </ul> I'm very familiar with the mbstring extension and am not asking "how does UTF-8 work in PHP?". I'd like advice from people with experience in real-world situations how they've handled this. As part of the solution, I'd really like to see a fast method to convert invalid characters to U+FFFD.

The <code>accept-charset="UTF-8"</code> attribute is only a guideline for browsers to follow, and they are not forced to submit that in that way. Crappy form submission bots are a good example... I usually ignore bad characters, either via <code>iconv()</code> or with the less reliable <code>utf8_encode()</code> / <code>utf8_decode()</code> functions. If you use <code>iconv</code>, you also have the option to transliterate bad characters. Here is an example using <code>iconv()</code>: <pre class="prettyprint"><code>$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str); $str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str); </code></pre> If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis. Something like this would probably do just fine: <pre class="prettyprint"><code>function utf8_clean($str) { return iconv('UTF-8', 'UTF-8//IGNORE', $str); } $clean_GET = array_map('utf8_clean', $_GET); if (serialize($_GET) != serialize($clean_GET)) { $_GET = $clean_GET; $error_msg = 'Your data is not valid UTF-8 and has been stripped.'; } // $_GET is clean! </code></pre> You may also want to normalize new lines and strip (non-)visible control chars, like this: <pre class="prettyprint"><code>function Clean($string, $control = true) { $string = iconv('UTF-8', 'UTF-8//IGNORE', $string); if ($control === true) { return preg_replace('~\p{C}+~u', '', $string); } return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string); } </code></pre> <hr> Code to convert from UTF-8 to Unicode code points: <pre class="prettyprint"><code>function Codepoint($char) { $result = null; $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char)); if (is_array($codepoint) && array_key_exists(1, $codepoint)) { $result = sprintf('U+%04X', $codepoint[1]); } return $result; } echo Codepoint('à'); // U+00E0 echo Codepoint('ひ'); // U+3072 </code></pre> It is probably faster than any other alternative, but I haven't tested it extensively though. <hr> Example: <pre class="prettyprint"><code>$string = 'hello world�'; // U+FFFEhello worldU+FFFD echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string); function Bad_Codepoint($string) { $result = array(); foreach ((array) $string as $char) { $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char)); if (is_array($codepoint) && array_key_exists(1, $codepoint)) { $result[] = sprintf('U+%04X', $codepoint[1]); } } return implode('', $result); } </code></pre> This may be what you were looking for.

Receiving invalid characters from your web application might have to do with the character sets assumed for HTML forms. You can specify which character set to use for forms with the <code>accept-charset</code> attribute: <pre class="prettyprint lang-html prettyprint-override"><code><form action="..." accept-charset="UTF-8"> </code></pre> You also might want to take a look at similar questions on Stack Overflow for pointers on how to handle invalid characters, e.g., those in the column to the right, but I think that signaling an error to the user is better than trying to clean up those invalid characters which cause unexpected loss of significant data or unexpected change of your user's inputs.

How to handle user input of invalid UTF-8 characters?

I'm looking for a general strategy/advice on how to handle invalid UTF-8 input from users.

Even though my web application uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP's json_encode() and overall seems like a bad idea to have around.

W3C I18N FAQ: Multilingual Forms says "If non-UTF-8 data is received, an error message should be sent back.".

How exactly should this be practically done, throughout a site with dozens of different places where data can be input?
How do you present the error in a helpful way to the user?
How do you temporarily store and display bad form data so the user doesn't lose all their text? Strip bad characters? Use a replacement character, and how?
For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?

I'm very familiar with the mbstring extension and am not asking "how does UTF-8 work in PHP?". I'd like advice from people with experience in real-world situations how they've handled this.

As part of the solution, I'd really like to see a fast method to convert invalid characters to U+FFFD.

What is an invalid UTF-8 character?

This error is created when the uploaded file is not in a UTF-8 format. UTF-8 is the dominant character encoding format on the World Wide Web. This error occurs because the software you are using saves the file in a different type of encoding, such as ISO-8859, instead of UTF-8.

How do you remove a non UTF-8 character from a text file in Unix?

To automatically find and delete non-UTF-8 characters, we're going to use the iconv command. It is used in Linux systems to convert text from one character encoding to another.

How do I check if a UTF-8 file is valid?

$ iconv -f UTF-8 your_file > /dev/null; echo $? The command will return 0 if the file could be converted successfully, and 1 if not. Additionally, it will print out the byte offset where the invalid byte sequence occurred. Edit: The output encoding doesn't have to be specified, it will be assumed to be UTF-8.

The accept-charset="UTF-8" attribute is only a guideline for browsers to follow, and they are not forced to submit that in that way. Crappy form submission bots are a good example...

I usually ignore bad characters, either via iconv() or with the less reliable utf8_encode() / utf8_decode() functions. If you use iconv, you also have the option to transliterate bad characters.

Here is an example using iconv():

$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str); $str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);

If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis. Something like this would probably do just fine:

function utf8_clean($str) {     return iconv('UTF-8', 'UTF-8//IGNORE', $str); }  $clean_GET = array_map('utf8_clean', $_GET);  if (serialize($_GET) != serialize($clean_GET)) {     $_GET = $clean_GET;     $error_msg = 'Your data is not valid UTF-8 and has been stripped.'; }  // $_GET is clean!

You may also want to normalize new lines and strip (non-)visible control chars, like this:

function Clean($string, $control = true) {     $string = iconv('UTF-8', 'UTF-8//IGNORE', $string);      if ($control === true)     {             return preg_replace('~\p{C}+~u', '', $string);     }      return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string); }

Code to convert from UTF-8 to Unicode code points:

function Codepoint($char) {     $result = null;     $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));      if (is_array($codepoint) && array_key_exists(1, $codepoint))     {         $result = sprintf('U+%04X', $codepoint[1]);     }      return $result; }  echo Codepoint('à'); // U+00E0 echo Codepoint('ひ'); // U+3072

It is probably faster than any other alternative, but I haven't tested it extensively though.

Example:

$string = 'hello world�';  // U+FFFEhello worldU+FFFD echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);  function Bad_Codepoint($string) {     $result = array();      foreach ((array) $string as $char)     {         $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));          if (is_array($codepoint) && array_key_exists(1, $codepoint))         {             $result[] = sprintf('U+%04X', $codepoint[1]);         }     }      return implode('', $result); }

This may be what you were looking for.

Receiving invalid characters from your web application might have to do with the character sets assumed for HTML forms. You can specify which character set to use for forms with the accept-charset attribute:

<form action="..." accept-charset="UTF-8">

You also might want to take a look at similar questions on Stack Overflow for pointers on how to handle invalid characters, e.g., those in the column to the right, but I think that signaling an error to the user is better than trying to clean up those invalid characters which cause unexpected loss of significant data or unexpected change of your user's inputs.

How to handle user input of invalid UTF-8 characters?

Tags:

php

encoding

utf-8

philfreo

People also ask

2 Answers

Alix Axel

Arc

Recent Activity

Donate For Us

How to handle user input of invalid UTF-8 characters?

Tags:

php

encoding

utf-8

philfreo

People also ask

2 Answers

Alix Axel

Arc

Related questions

Recent Activity

Donate For Us