Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP: mb_strtoupper not working

I have a problem with UTF-8 and mb_strtoupper.

mb_internal_encoding('UTF-8');
$guesstitlestring='Le Courrier de Sáint-Hyácinthe';

$encoding=mb_detect_encoding($guesstitlestring);
if ($encoding!=='UTF-8') $guesstitlestring=mb_convert_encoding($guesstitlestring,'UTF-8',$encoding);

echo "DEBUG1 $guesstitlestring\n";
$guesstitlestring=mb_strtoupper($guesstitlestring);
echo "DEBUG2 $guesstitlestring\n";

Result:

DEBUG1 Le Courrier de Sáint-Hyácinthe
DEBUG2 LE COURRIER DE S?INT-HY?CINTHE

I don't understand why this is happening? I'm trying to be as careful as I can with the encoding. The string is given first as a UTF-8, verified and possible reconverted to UTF-8. It's a nightmare!

UPDATE

So I've figured out that this was caused by a combination of my entering the arguments via the console and the arguments coming back out of the console. So they were garbled both on the way in and the way out. The solution is to not enter any of the arguments in this way, or get the arguments out in this way.

Thank you everyone for your help in resolving this issue!

like image 560
Alasdair Avatar asked Feb 24 '13 11:02

Alasdair


2 Answers

Instead of strtoupper()/mb_strtoupper() use mb_convert_case() since upper case converting is very tricky across different encodings, also make sure your string IS UTF-8.

$content = 'Le Courrier de Sáint-Hyácinthe';

mb_internal_encoding('UTF-8');
if(!mb_check_encoding($content, 'UTF-8')
    OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {

    $content = mb_convert_encoding($content, 'UTF-8'); 
}

// LE COURRIER DE SÁINT-HYÁCINTHE
echo mb_convert_case($content, MB_CASE_UPPER, "UTF-8"); 

Working example: http://3v4l.org/enEfm#v443

See also my comment at the PHP website about the converter: http://www.php.net/manual/function.utf8-encode.php#102382

like image 180
powtac Avatar answered Sep 21 '22 15:09

powtac


It works for me, but only when the php file itself is saved as UTF-8 and when the terminal that I'm in expects UTF-8. I think what is happening for you is that the file is saved as ISO-8859-1 and your terminal is expecting ISO-8859-1.

First, mb_detect_encoding doesn't actually work for this string. Even when the PHP file is not UTF-8, it still reports it as UTF-8.

When you print the lower case string, it prints ISO-8859-1 characters and your terminal displays them just fine. Then when you convert to upper case using UTF-8, it gets mangled.

I created two versions of this file. I saved it using my text editor in ISO-8859-1 as iso-8859-1.php. Then I used iconv to convert the entire file to UTF-8 and saved it as utf-8.php

iconv iso-8859-1.php --from iso-8859-1 --to UTF-8 > utf-8.php

I added a line to print the result the encoding that mb_detect_encoding returns.

$ file iso-8859-1.php 
iso-8859-1.php: PHP script, ISO-8859 text

$ php iso-8859-1.php 
ENCODING: UTF-8
DEBUG1 Le Courrier de S�int-Hy�cinthe
DEBUG2 LE COURRIER DE S?INT-HY?CINTHE

$ file utf-8.php 
utf-8.php: PHP script, UTF-8 Unicode text

$ php utf-8.php 
ENCODING: UTF-8
DEBUG1 Le Courrier de Sáint-Hyácinthe
DEBUG2 LE COURRIER DE SÁINT-HYÁCINTHE

My terminal actually expects UTF-8 text, so when I print out ISO-8859-1 text it gets mangled. Everything works correctly when the file is saved as utf-8 and the terminal expects utf-8.

like image 2
Stephen Ostermiller Avatar answered Sep 20 '22 15:09

Stephen Ostermiller