I'm using PHP to handle text from a variety of sources. I don't anticipate it will be anything other than UTF-8, ISO 8859-1, or perhaps Windows-1252. If it's anything other than one of those, I just need to make sure the text gets turned into a valid UTF-8 string, even if characters are lost. Does the //TRANSLIT option of iconv solve this? For example, would this code ensure that a string is safe to insert into a UTF-8 encoded document (or database)? <pre class="prettyprint"><code>function make_safe_for_utf8_use($string) { $encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252"); if ($encoding != 'UTF-8') { return iconv($encoding, 'UTF-8//TRANSLIT', $string); } else { return $string; } } </code></pre>

With the mbstring library, you have mb_check_encoding(). Example of use: <pre class="prettyprint"><code>mb_check_encoding($string, 'UTF-8'); </code></pre> With PHP 7.1.9 on a recent Windows 10 system, the regex solution outperforms <code>mb_check_encoding()</code> for any string length (still 20,000 iterations): <ul> <li>10 characters: regex => 4 ms, <code>mb_check_encoding()</code> => 64 ms</li> <li>10000 chars: regex => 125 ms, <code>mb_check_encoding()</code> => 2.4 s</li> </ul>

Ensuring valid UTF-8 in PHP

Tags:

I'm using PHP to handle text from a variety of sources. I don't anticipate it will be anything other than UTF-8, ISO 8859-1, or perhaps Windows-1252. If it's anything other than one of those, I just need to make sure the text gets turned into a valid UTF-8 string, even if characters are lost. Does the //TRANSLIT option of iconv solve this?

For example, would this code ensure that a string is safe to insert into a UTF-8 encoded document (or database)?

function make_safe_for_utf8_use($string) {      $encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");      if ($encoding != 'UTF-8') {         return iconv($encoding, 'UTF-8//TRANSLIT', $string);     }     else {         return $string;     } }

617

asked Oct 06 '09 03:10

Brian

2 Answers

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.

Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.

would this code ensure that a string is safe to insert into a UTF-8 encoded document

You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.

If you want to be sure, do it yourself using the W3-recommended regex:

if (preg_match('%^(?:       [\x09\x0A\x0D\x20-\x7E]            # ASCII     | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte     | \xE0[\xA0-\xBF][\x80-\xBF]         # excluding overlongs     | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte     | \xED[\x80-\x9F][\x80-\xBF]         # excluding surrogates     | \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3     | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15     | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16 )*$%xs', $string))     return $string; else     return iconv('CP1252', 'UTF-8', $string);

165

answered Oct 07 '22 14:10

bobince

With the mbstring library, you have mb_check_encoding().

Example of use:

mb_check_encoding($string, 'UTF-8');

With PHP 7.1.9 on a recent Windows 10 system, the regex solution outperforms mb_check_encoding() for any string length (still 20,000 iterations):

10 characters: regex => 4 ms, mb_check_encoding() => 64 ms
10000 chars: regex => 125 ms, mb_check_encoding() => 2.4 s

answered Oct 07 '22 16:10

Frosty Z

Related questions
                            
                                Enumerating Collections that are not inherently IEnumerable?
                            
                                What is the difference between HttpResponse vs HttpResponseRedirect vs render_to_response?
                            
                                Read contents of a URL in Android
                            
                                How to migrate ugly and undocumented VB6 Code to .NET
                            
                                Closures in C# event handler delegates? [duplicate]
                            
                                IIS7 URL Rewriting: How not to drop HTTPS protocol from rewritten URL?
                            
                                How to get source code of a Windows executable?
                            
                                php : get file contents and store file in particular folder
                            
                                How to declare NSString constants for passing to NSNotificationCenter
                            
                                log4j relative file path [duplicate]
                            
                                Where do I put my php files to have Xampp parse them?
                            
                                How to alias a built-in type in C#?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With