Detect encoding and make everything UTF-8

Tags:

I'm reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO 8859-1.

Unfortunately, there are sometimes problems with the encodings of the texts. Example:

The "ß" in "Fußball" should look like this in my database: "ÂŸ". If it is a "ÂŸ", it is displayed correctly.
Sometimes, the "ß" in "Fußball" looks like this in my database: "ÃƒÂŸ". Then it is displayed wrongly, of course.
In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.

What can I do to avoid the cases 2 and 3?

How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?

How do I make everything the same encoding? Perhaps with the function mb_detect_encoding()? Can I write a function for this? So my problems are:

How do I find out what encoding the text uses?
How do I convert it to UTF-8 - whatever the old encoding is?

Would a function like this work?

function correct_encoding($text) {     $current_encoding = mb_detect_encoding($text, 'auto');     $text = iconv($current_encoding, 'UTF-8', $text);     return $text; }

I've tested it, but it doesn't work. What's wrong with it?

686

asked May 26 '09 13:05

caw

2 Answers

If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output.

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF-8.

I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.

Usage:

require_once('Encoding.php'); use \ForceUTF8\Encoding;  // It's namespaced now.  $utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);  $latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

https://github.com/neitanod/forceutf8

I've included another function, Encoding::fixUFT8(), which will fix every UTF-8 string that looks garbled.

Usage:

require_once('Encoding.php'); use \ForceUTF8\Encoding;  // It's namespaced now.  $utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football"); echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football"); echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football"); echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");

will output:

Fédération Camerounaise de Football Fédération Camerounaise de Football Fédération Camerounaise de Football Fédération Camerounaise de Football

I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().

answered Oct 15 '22 21:10

Sebastián Grignoli

You first have to detect what encoding has been used. As you’re parsing RSS feeds (probably via HTTP), you should read the encoding from the charset parameter of the Content-Type HTTP header field. If it is not present, read the encoding from the encoding attribute of the XML processing instruction. If that’s missing too, use UTF-8 as defined in the specification.

Edit Here is what I probably would do:

I’d use cURL to send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Type header field that contains the MIME type and (hopefully) the charset parameter with the encoding/charset too. If not, we’ll analyse the XML PI for the presence of the encoding attribute and get the encoding from there. If that’s also missing, the XML specs define to use UTF-8 as encoding.

$url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';  $accept = array(     'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),     'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit')) ); $header = array(     'Accept: '.implode(', ', $accept['type']),     'Accept-Charset: '.implode(', ', $accept['charset']), ); $encoding = null; $curl = curl_init($url); curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); curl_setopt($curl, CURLOPT_HEADER, true); curl_setopt($curl, CURLOPT_HTTPHEADER, $header); $response = curl_exec($curl); if (!$response) {     // error fetching the response } else {     $offset = strpos($response, "\r\n\r\n");     $header = substr($response, 0, $offset);     if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {         // error parsing the response     } else {         if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {             // type not accepted         }         $encoding = trim($match[2], '"\'');     }     if (!$encoding) {         $body = substr($response, $offset + 4);         if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {             $encoding = trim($match[1], '"\'');         }     }     if (!$encoding) {         $encoding = 'utf-8';     } else {         if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {             // encoding not accepted         }         if ($encoding != 'utf-8') {             $body = mb_convert_encoding($body, 'utf-8', $encoding);         }     }     $simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);     if (!$simpleXML) {         // parse error     } else {         echo $simpleXML->asXML();     } }

answered Oct 15 '22 21:10

Gumbo

Related questions
                            
                                ?: operator (the 'Elvis operator') in PHP
                            
                                List of All Locales and Their Short Codes?
                            
                                How to send a GET request from PHP?
                            
                                How do I read any request header in PHP
                            
                                How do I create a simple 'Hello World' module in Magento?
                            
                                Insert string at specified position
                            
                                How to get time difference in minutes in PHP
                            
                                Sending POST data in Android
                            
                                Simplest way to profile a PHP script
                            
                                nginx error connect to php5-fpm.sock failed (13: Permission denied)
                            
                                How can I sort arrays and data in PHP?
                            
                                Split a comma-delimited string into an array?
                            
                                break out of if and foreach
                            
                                Best way to initialize (empty) array in PHP
                            
                                Accurate way to measure execution times of php scripts
                            
                                How to post data in PHP using file_get_contents?
                            
                                What is the most accurate way to retrieve a user's correct IP address in PHP?
                            
                                Make XAMPP / Apache serve file outside of htdocs folder [closed]
                            
                                Best practices to test protected methods with PHPUnit
                            
                                JSON encode MySQL results

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Detect encoding and make everything UTF-8

Tags:

php

character-encoding

encoding

utf-8

caw

People also ask

2 Answers

Sebastián Grignoli

Gumbo

Recent Activity

Donate For Us