Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ensure a string is UTF-8 encoded

Tags:

php

csv

utf-8

In my application I read a csv file and display contents to the user. But there is a problem with encoding.

I have two csv files example1.csv and example2.csv. I have both opened in notepad++, which shows ANSI encoding for example1 and UTF-8 without BOM for example2.

First, I tried mb_detect_encoding function to detect encoding but it shows me UTF-8 in both cases, which is not correct.

Second, I try to convert the file content to UTF-8 using utf8_encode. That works for the ANSI file. But for the UTF-8 without BOM file it seems that it was encoded back to ANSI. It displays à instead of the german ß. Same for other special characters.

I want to ensure that contents are always in UTF-8 format before displaying or processing them. So is there anything I do wrong?


This is how I use the mb_detect_encoding function:

$file_content = file_get_contents($_FILES['file']['tmp_name']);

die(var_dump( mb_detect_encoding($file_content))); 

and it prints UTF-8 for both examples.

like image 338
UpCat Avatar asked Mar 02 '13 16:03

UpCat


1 Answers

Into: another inconvenient truth

It is impossible to detect the encoding of unknown text with 100% accuracy and/or confidence.

In practice there will be cases all over the spectrum of possible outcomes: you can be pretty sure that multilingual text in UTF-8 will be correctly detected as such, while it is flat out impossible to detect which of the family of ISO-8859 encodings corresponds to some text -- and unless you are willing to do statistical analysis, it is not even possible to make an educated guess!

What do we have to work with?

With that out of the way, let's see what you can do. First of all, unless you are bringing custom tools into the fight you are limited by what mb_detect_encoding can do for you. Unfortunately, that's not a whole lot. The documentation of the sister function mb_detect_order states:

mbstring currently implements the following encoding detection filters. If there is an invalid byte sequence for the following encodings, encoding detection will fail.

UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS, ISO-2022-JP.

For ISO-8859-X, mbstring always detects as ISO-8859-X.

For UTF-16, UTF-32, UCS2 and UCS4, encoding detection will fail always.

So, discounting the Japanese encodings, you basically have the capability to distinguish between UTF-8, UTF-7 and ASCII. You cannot detect ISO-8859-X because any text will be "recognized" as any of those encodings if you put it into consideration (i.e. you will have a 100% false positive rate -- not good), and the group which includes UTF-16 is simply not supported.

Unfortunately, the bad news doesn't end there. The order of the encodings matters too! Since text encoded in UTF-7 or ASCII is also valid UTF-8, placing UTF-8 at the front of the candidate list will ensure that's the only result you are ever going to get -- so it has to be avoided at all costs.

Since the default detection order is dependent on a php.ini setting, you should definitely not rely on that and move into a known state by setting your own detection order:

mb_detect_order('ASCII, UTF-8'); // I left UTF-7 out, but who cares?

So you can at least tell if your text is ASCII or UTF-8, right? Well, no. Not unless you specifically request that when you say "UTF-8", you really mean it:

$valid_utf8 = "\xC2\xA2";
$invalid_utf8 = "\xC2\x00";

mb_detect_order('UTF-8');
echo mb_detect_encoding($valid_utf8);   // "utf-8": correct
echo mb_detect_encoding($invalid_utf8); // "utf-8": WTF?!?!?!

The problem above is that unless you pass true for the $strict parameter, detection of UTF-8 is... a little over-optimistic.

Well, what can you actually do with this thing?

This is as good as it gets -- the correct way to detect encodings (just barely managing to keep using plural here):

$valid_utf8 = "\xC2\xA2";
$invalid_utf8 = "\xC2\x00";
$ascii = "hello world";

mb_detect_order('ASCII, UTF-8');
echo mb_detect_encoding($valid_utf8, mb_detect_order(), true);   // OK: "utf-8"
echo mb_detect_encoding($invalid_utf8, mb_detect_order(), true); // OK: false
echo mb_detect_encoding($ascii, mb_detect_order(), true);        // OK: "ascii"

What can be done with text that isn't valid UTF-8?

Unless you have out-of-band information about that text, unfortunately nothing.

OK, that's not entirely true. There are a few things that you can do in practice:

  1. See if there's a BOM in the beginning of the text. Probably there won't be, and even if there is mathematically you might mistake a single-byte encoding for Unicode, but it's worth a shot.
  2. See if it's a flavor of UTF-16. If a big majority of the even-numbered bytes have the same value, then you 're likely looking at UTF-16 LE. If this happens for a majority of the odd-numbered bytes, you 're likely looking at UTF-16 BE. Unforunately, in both cases you can never be sure.
  3. Assume that the text is in ISO-8859-X and do statistical analysis based on known properties of the script that corresponds to this encoding to see if the result is close to what you would expect. If it's close enough for some encodings in this class and way off for the others you can make an educated guess.
like image 126
Jon Avatar answered Sep 22 '22 12:09

Jon