Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert 'u00e9' into a utf8 char, in mysql or php?

Tags:

Im doing some data cleansing on some messy data which is being imported into mysql.

The data contains 'pseudo' unicode chars, which are actually embedded into the strings as 'u00e9' etc.

So one field might be.. 'Jalostotitlu00e1n' I need to rip out that clumsy 'u00e1n' and replace it with the corresponding utf character

I can do this in either mysql, using substring and CHR maybe, but Im preprocssing the data via PHP, so I could do it there also.

I already know all about how to configure mysql and php to work with utf data. The problem is really just in the source data Im importing.

Thanks

like image 867
carpii Avatar asked Aug 15 '11 03:08

carpii


People also ask

What is utf8 PHP?

The utf8_encode() function is an inbuilt function in PHP which is used to encode an ISO-8859-1 string to UTF-8. Unicode has been developed to describe all possible characters of all languages and includes a lot of symbols with one unique number for each symbol/character.

How do you convert to UTF?

Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.


3 Answers

/* Function php for convert utf8 html to ansi */

public static function Utf8_ansi($valor='') {      $utf8_ansi2 = array(     "\u00c0" =>"À",     "\u00c1" =>"Á",     "\u00c2" =>"Â",     "\u00c3" =>"Ã",     "\u00c4" =>"Ä",     "\u00c5" =>"Å",     "\u00c6" =>"Æ",     "\u00c7" =>"Ç",     "\u00c8" =>"È",     "\u00c9" =>"É",     "\u00ca" =>"Ê",     "\u00cb" =>"Ë",     "\u00cc" =>"Ì",     "\u00cd" =>"Í",     "\u00ce" =>"Î",     "\u00cf" =>"Ï",     "\u00d1" =>"Ñ",     "\u00d2" =>"Ò",     "\u00d3" =>"Ó",     "\u00d4" =>"Ô",     "\u00d5" =>"Õ",     "\u00d6" =>"Ö",     "\u00d8" =>"Ø",     "\u00d9" =>"Ù",     "\u00da" =>"Ú",     "\u00db" =>"Û",     "\u00dc" =>"Ü",     "\u00dd" =>"Ý",     "\u00df" =>"ß",     "\u00e0" =>"à",     "\u00e1" =>"á",     "\u00e2" =>"â",     "\u00e3" =>"ã",     "\u00e4" =>"ä",     "\u00e5" =>"å",     "\u00e6" =>"æ",     "\u00e7" =>"ç",     "\u00e8" =>"è",     "\u00e9" =>"é",     "\u00ea" =>"ê",     "\u00eb" =>"ë",     "\u00ec" =>"ì",     "\u00ed" =>"í",     "\u00ee" =>"î",     "\u00ef" =>"ï",     "\u00f0" =>"ð",     "\u00f1" =>"ñ",     "\u00f2" =>"ò",     "\u00f3" =>"ó",     "\u00f4" =>"ô",     "\u00f5" =>"õ",     "\u00f6" =>"ö",     "\u00f8" =>"ø",     "\u00f9" =>"ù",     "\u00fa" =>"ú",     "\u00fb" =>"û",     "\u00fc" =>"ü",     "\u00fd" =>"ý",     "\u00ff" =>"ÿ");      return strtr($valor, $utf8_ansi2);        } 
like image 188
Sergio-MA-Brazil Avatar answered Sep 20 '22 12:09

Sergio-MA-Brazil


There's a way. Replace all uXXXX with their HTML representation and do an html_entity_decode()

I.e. echo html_entity_decode("Jalostotitlán");

Every UTF character in the form u1234 could be printed in HTML as ሴ. But doing a replace is quite hard, because there could be much false positives if there is no other char that identifies the beginning of an UTF sequence. A simple regex could be

preg_replace('/u([\da-fA-F]{4})/', '&#x\1;', $str)

like image 30
rabudde Avatar answered Sep 16 '22 12:09

rabudde


My twitter timeline script returns the special characters like é into \u00e9 so I stripped the backslash and used @rubbude his preg_replace.

// Fix uxxxx charcoding to html
$text = "De #Haarstichting is h\u00e9t medium voor alles Into:  De #Haarstichting is hét medium voor alles";
$str     = str_replace('\u','u',$text);
$str_replaced = preg_replace('/u([\da-fA-F]{4})/', '&#x\1;', $str);

echo $str_replaced;

It workes for me and it turns: De #Haarstichting is h\u00e9t medium voor alles Into: De #Haarstichting is hét medium voor alles

like image 36
Theo Avatar answered Sep 18 '22 12:09

Theo