Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange behaviour of mb_detect_order() in PHP

Tags:

php

encoding

I would like to detect encoding of some text (using PHP). For that purpose i use mb_detect_encoding() function.

The problem is that the function returns different results if i change the order of possible encodings with mb_detect_order() function.

Consider the following example

$html = <<< STR
ちょっとのアクセスで落ちてしまったり、サーバー障害が多いレンタルサーバーを選ぶとあなたのビジネス等にかなりの影響がでてしまう可能性があります。特に商売をされている個人の方、法人の方は気をつけるようにしてください
STR;
mb_detect_order(array('UTF-8','EUC-JP', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));
$originalEncoding = mb_detect_encoding($str);
die($originalEncoding); // $originalEncoding = 'UTF-8'

However if you change the order of encodings in mb_detect_order() the results will be different:

mb_detect_order(array('EUC-JP','UTF-8', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));        
die($originalEncoding); // $originalEncoding = 'EUC-JP'



So my questions are:
Why is that happening ?
Is there a way in PHP to correctly and unambiguously detect encoding of text ?

like image 964
Termos Avatar asked May 21 '10 10:05

Termos


People also ask

How do I check if a string is UTF-8 PHP?

is_utf8() – check for UTF-8 With this PHP function it's possible to check whether a string is encoded as UTF-8 or not, or seems to be, at least. It scans a string for invalid UTF-8 characters (or bytes) and returns false, if it finds any.

How do I know if a string is encoded?

In PHP, mb_detect_encoding() is used to detect the character encoding. It can detect the character encoding for a string from an ordered list of candidates. This function is supported in PHP 4.0. 6 or higher version.

Is utf8 encoded PHP?

Definition and Usage. The utf8_encode() function encodes an ISO-8859-1 string to UTF-8. Unicode is a universal standard, and has been developed to describe all possible characters of all languages plus a lot of symbols with one unique number for each character/symbol.

How do you determine character encoding?

One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).


Video Answer


1 Answers

That's what I would expect to happen.

The detection algorithm probably just keeps trying, in order, the encodings you specified in mb_detect_order and then returns the first one under which the bytestream would be valid.

Something more intelligent requires statistical methods (I think machine learning is commonly used).

EDIT: See e.g. this article for more intelligent methods.

Due to its importance, automatic charset detection is already implemented in major Internet applications such as Mozilla or Internet Explorer. They are very accurate and fast, but the implementation applies many domain specific knowledges in case-by-case basis. As opposed to their methods, we aimed at a simple algorithm which can be uniformly applied to every charset, and the algorithm is based on well-established, standard machine learning techniques. We also studied the relationship between language and charset detection, and compared byte-based algorithms and character-based algorithms. We used Naive Bayes (NB) and Support Vector Machine (SVM).

like image 66
Artefacto Avatar answered Sep 23 '22 02:09

Artefacto