I'm in the process of trying to convert our database from latin1 to UTF-8. Unfortunately I can't do a massive single switchover as the application needs to stay online and we have 700GB of database to convert. So I'm trying to leverage a little mysql hack of converting tables to UTF-8 however not the data. I'd like the data to be read, converted, and replaced in real time. (A JIT conversion if you will) Our php app currently uses all of the defaults so it's connecting to mysql using the latin1 character set and it drops UTF-8 data encoded in latin1. When you view the data with latin1 the UTF-8 characters show up as expected. When you view the data with UTF-8 things get jumbled up. So I propose forcing the mysql character set to UTF-8 and then doing a just in time conversion of the data if necessary. Now, seeing as cp1252/windows-1252 is a subset of UTF-8 it's not so straight forward (as far as I can see) to detect the cp1252/windows-1252 encoding. I've written the following code that attempts to detect cp1252/windows-1252 encoding and convert as necessary. It should also detect properly encoded UTF-8 characters and do nothing. <pre class="prettyprint"><code>$a = 'Cardâ&tilde;&fnof;'; //cp1252 encoded $a_test = '☃'.$a; //add known UTF8 character $c = mb_convert_encoding($a_test, 'cp1252', 'UTF-8'); // attempt to detect known utf8 character after conversion if (mb_strpos($c, '☃') === false) { // not found, original string was not cp1252 encoded, so print var_dump($a); } else { // found, original string was cp1252 encoded, remove test character and print // This case runs $c = mb_strcut($c, 1); var_dump($c); } $a = 'COD☃'; //proper UTF8 encoded $a_test = '☃'.$a; //add known UTF8 character $c = mb_convert_encoding($a_test, 'cp1252', 'UTF-8'); // attempt to detect known utf8 character after conversion if (mb_strpos($c, '☃') === false) { // not found, original string was not cp1252 encoded, so print // This case runs var_dump($a); } else { // found, original string was cp1252 encoded, remove test character and print $c = mb_strcut($c, 1); var_dump($c); } </code></pre> The output of running this code is: <pre class="prettyprint"><code>string 'Card☃' (length=7) string 'COD☃' (length=6) </code></pre> I understand that running this on all strings coming out of the database will have a performance impact, yet to be measured, but if I can do a JIT conversion before switching everything completely it's worth it to me. Does anyone have any pointers on how to optimize this?

Firstly, Windows-1252 is not a subset of UTF-8. You could argue that ASCII is a subset of UTF-8, but that is usually more of an ideological debate. Secondly, it is impossible to handle strings with both CP1252 and UTF-8 "characters" in them (really for CP1252 it's a byte and for Unicode it's a code point). Either you try to read it as CP1252, and see all the Unicode characters as single bytes, or you read it as UTF-8 and it cuts out any invalid byte sequences (or creates random characters if the CP1252 characters match a Unicode code point). You are not removing the test character with <code>$c = mb_strcut($c, 1);</code>, you are removing a question mark created by mb_convert_encoding because it could not convert that Unicode character into a CP1252 character. Thirdly, you should never convert a String, and then after the fact try to determine the encoding. After you converted your second test string, it was <code>?COD?</code>. There is no reason to check if a Unicode character exists in it, because you converted it to CP1252. There can't be Unicode characters in it. As the programmer, you have to know what the output is. The only solution is to check if the string is CP1252, convert the offending characters to placeholders, and then convert that string to Unicode: <pre class="prettyprint"><code>function convert_cp1252_to_utf8($input, $default = '', $replace = array()) { if ($input === null || $input == '') { return $default; } // https://en.wikipedia.org/wiki/UTF-8 // https://en.wikipedia.org/wiki/ISO/IEC_8859-1 // https://en.wikipedia.org/wiki/Windows-1252 // http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT $encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true); if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') { /* * Use the search/replace arrays if a character needs to be replaced with * something other than its Unicode equivalent. */ /*$replace = array( 128 => "&#x20AC;", // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN 129 => "", // UNDEFINED 130 => "&#x201A;", // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK 131 => "&#x0192;", // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK 132 => "&#x201E;", // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK 133 => "&#x2026;", // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS 134 => "&#x2020;", // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER 135 => "&#x2021;", // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER 136 => "&#x02C6;", // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT 137 => "&#x2030;", // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN 138 => "&#x0160;", // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON 139 => "&#x2039;", // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK 140 => "&#x0152;", // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE 141 => "", // UNDEFINED 142 => "&#x017D;", // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON 143 => "", // UNDEFINED 144 => "", // UNDEFINED 145 => "&#x2018;", // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK 146 => "&#x2019;", // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK 147 => "&#x201C;", // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK 148 => "&#x201D;", // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK 149 => "&#x2022;", // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET 150 => "&#x2013;", // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH 151 => "&#x2014;", // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH 152 => "&#x02DC;", // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE 153 => "&#x2122;", // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN 154 => "&#x0161;", // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON 155 => "&#x203A;", // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK 156 => "&#x0153;", // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE 157 => "", // UNDEFINED 158 => "&#x017e;", // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON 159 => "&#x0178;", // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS );*/ if (count($replace) != 0) { $find = array(); foreach (array_keys($replace) as $key) { $find[] = chr($key); } $input = str_replace($find, array_values($replace), $input); } /* * Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F * and control characters, always convert from Windows-1252 to UTF-8. */ $input = iconv('Windows-1252', 'UTF-8//IGNORE', $input); if (count($replace) != 0) { $input = html_entity_decode($input); } } return $input; } </code></pre> The trick is that you have to check for both <code>ISO-8859-1</code> and <code>CP1252</code> because they are so similar. I found this out the hard way after hours of playing around with this function, only to have this answer save me. If you found this function helpful, go +1 that answer. Basically, this function replaces all those bad CP1252 bytes with HTML entities representing the Unicode characters. We then convert the string from <code>ISO-8859-1</code>/<code>CP1252</code> to <code>UTF-8</code>, while none of our new Unicode characters are mangled because they are simple ASCII characters. Finally, we decode the HTML entities and finally have a 100% Unicode string.

PHP cp1252/windows-1252 conversion to UTF-8

Tags:

php

mysql

character-encoding

encoding

utf-8

I'm in the process of trying to convert our database from latin1 to UTF-8. Unfortunately I can't do a massive single switchover as the application needs to stay online and we have 700GB of database to convert.

So I'm trying to leverage a little mysql hack of converting tables to UTF-8 however not the data. I'd like the data to be read, converted, and replaced in real time. (A JIT conversion if you will)

Our php app currently uses all of the defaults so it's connecting to mysql using the latin1 character set and it drops UTF-8 data encoded in latin1. When you view the data with latin1 the UTF-8 characters show up as expected. When you view the data with UTF-8 things get jumbled up.

So I propose forcing the mysql character set to UTF-8 and then doing a just in time conversion of the data if necessary. Now, seeing as cp1252/windows-1252 is a subset of UTF-8 it's not so straight forward (as far as I can see) to detect the cp1252/windows-1252 encoding.

I've written the following code that attempts to detect cp1252/windows-1252 encoding and convert as necessary. It should also detect properly encoded UTF-8 characters and do nothing.

$a = 'Cardâ˜ƒ'; //cp1252 encoded
$a_test = '☃'.$a; //add known UTF8 character
$c = mb_convert_encoding($a_test, 'cp1252', 'UTF-8');
// attempt to detect known utf8 character after conversion
if (mb_strpos($c, '☃') === false) {
    // not found, original string was not cp1252 encoded, so print
    var_dump($a);
} else {
    // found, original string was cp1252 encoded, remove test character and print
    // This case runs
    $c = mb_strcut($c, 1);
    var_dump($c);
}

$a = 'COD☃'; //proper UTF8 encoded
$a_test = '☃'.$a; //add known UTF8 character
$c = mb_convert_encoding($a_test, 'cp1252', 'UTF-8');
// attempt to detect known utf8 character after conversion
if (mb_strpos($c, '☃') === false) {
    // not found, original string was not cp1252 encoded, so print
    // This case runs
    var_dump($a);
} else {
    // found, original string was cp1252 encoded, remove test character and print
    $c = mb_strcut($c, 1);
    var_dump($c);
}

The output of running this code is:

string 'Card☃' (length=7)
string 'COD☃' (length=6)

I understand that running this on all strings coming out of the database will have a performance impact, yet to be measured, but if I can do a JIT conversion before switching everything completely it's worth it to me.

Does anyone have any pointers on how to optimize this?

616

asked Mar 27 '14 09:03

rnavarro

1 Answers

Firstly, Windows-1252 is not a subset of UTF-8. You could argue that ASCII is a subset of UTF-8, but that is usually more of an ideological debate.

Secondly, it is impossible to handle strings with both CP1252 and UTF-8 "characters" in them (really for CP1252 it's a byte and for Unicode it's a code point). Either you try to read it as CP1252, and see all the Unicode characters as single bytes, or you read it as UTF-8 and it cuts out any invalid byte sequences (or creates random characters if the CP1252 characters match a Unicode code point). You are not removing the test character with $c = mb_strcut($c, 1);, you are removing a question mark created by mb_convert_encoding because it could not convert that Unicode character into a CP1252 character.

Thirdly, you should never convert a String, and then after the fact try to determine the encoding. After you converted your second test string, it was ?COD?. There is no reason to check if a Unicode character exists in it, because you converted it to CP1252. There can't be Unicode characters in it. As the programmer, you have to know what the output is.

The only solution is to check if the string is CP1252, convert the offending characters to placeholders, and then convert that string to Unicode:

function convert_cp1252_to_utf8($input, $default = '', $replace = array()) {
    if ($input === null || $input == '') {
        return $default;
    }

    // https://en.wikipedia.org/wiki/UTF-8
    // https://en.wikipedia.org/wiki/ISO/IEC_8859-1
    // https://en.wikipedia.org/wiki/Windows-1252
    // http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
    $encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
    if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
        /*
         * Use the search/replace arrays if a character needs to be replaced with
         * something other than its Unicode equivalent.
         */ 

        /*$replace = array(
            128 => "&#x20AC;",      // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN
            129 => "",              // UNDEFINED
            130 => "&#x201A;",      // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK
            131 => "&#x0192;",      // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK
            132 => "&#x201E;",      // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK
            133 => "&#x2026;",      // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS
            134 => "&#x2020;",      // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER
            135 => "&#x2021;",      // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER
            136 => "&#x02C6;",      // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT
            137 => "&#x2030;",      // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN
            138 => "&#x0160;",      // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON
            139 => "&#x2039;",      // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK
            140 => "&#x0152;",      // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE
            141 => "",              // UNDEFINED
            142 => "&#x017D;",      // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON 
            143 => "",              // UNDEFINED
            144 => "",              // UNDEFINED
            145 => "&#x2018;",      // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK 
            146 => "&#x2019;",      // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK
            147 => "&#x201C;",      // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK
            148 => "&#x201D;",      // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK
            149 => "&#x2022;",      // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET
            150 => "&#x2013;",      // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH
            151 => "&#x2014;",      // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH
            152 => "&#x02DC;",      // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE
            153 => "&#x2122;",      // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN
            154 => "&#x0161;",      // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON
            155 => "&#x203A;",      // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
            156 => "&#x0153;",      // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE
            157 => "",              // UNDEFINED
            158 => "&#x017e;",      // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON
            159 => "&#x0178;",      // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS
        );*/

        if (count($replace) != 0) {
            $find = array();
            foreach (array_keys($replace) as $key) {
                $find[] = chr($key);
            }
            $input = str_replace($find, array_values($replace), $input);
        }
        /*
         * Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
         * and control characters, always convert from Windows-1252 to UTF-8.
         */
        $input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
        if (count($replace) != 0) {
            $input = html_entity_decode($input);
        }
    }
    return $input;
}

The trick is that you have to check for both ISO-8859-1 and CP1252 because they are so similar. I found this out the hard way after hours of playing around with this function, only to have this answer save me. If you found this function helpful, go +1 that answer.

Basically, this function replaces all those bad CP1252 bytes with HTML entities representing the Unicode characters. We then convert the string from ISO-8859-1/CP1252 to UTF-8, while none of our new Unicode characters are mangled because they are simple ASCII characters. Finally, we decode the HTML entities and finally have a 100% Unicode string.

144

answered Sep 30 '22 23:09

NobleUplift

Related questions
                            
                                PDO prepared statement with optional parameters
                            
                                Get value from JSON array in PHP
                            
                                Massive PHP array vs MySQL Database?
                            
                                Laravel same route, different controller
                            
                                PHP include "../" vs "/../"
                            
                                How to add collate to laravel query
                            
                                HTML form for PHP Image Upload Script
                            
                                Undefined index while checking for COOKIE in PHP
                            
                                Where exactly do I put a SESSION_START? [duplicate]
                            
                                What happens if session name is same on two different websites?
                            
                                ClassNotFoundException: Attempted to load class... Symfony
                            
                                htaccess multiple parameters rewrite rule
                            
                                Laravel Code Generator
                            
                                What is the difference between class and function in php? [closed]
                            
                                MYSQL Join multiple column from same table [closed]
                            
                                How do I find last occurence of "needle" in array php
                            
                                Laravel not showing index.php
                            
                                Get product weight from Wordpress
                            
                                WooCommerce 2.1 Detect Chosen Shipping Method
                            
                                geoPHP point in polygon

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With