Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mb_detect_encoding detects ASCII as UTF-8?

I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.

Currently it looks like this:

$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));

However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.

I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.

So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?

like image 889
Cobra_Fast Avatar asked Apr 30 '13 11:04

Cobra_Fast


People also ask

Is UTF-8 compatible with Ascii?

UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8. All other characters use two to four bytes.

What is Mb_detect_encoding?

mb_detect_encoding() detects character encoding in string str. It returns detected character encoding. encoding-list is list of character encoding. Encoding order may be specified by array or comma separated list string. If encoding_list is omitted, detect_order is used.

How do I check if a string is UTF-8 PHP?

You can use @ and check the length of the return string: strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));

How do you determine the encoding of a character?

One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).


1 Answers

Specifying a custom order, where ASCII is detected first, works.

mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');

For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php

like image 199
Cobra_Fast Avatar answered Sep 28 '22 08:09

Cobra_Fast