Strange behaviour of mb_detect_order() in PHP

Tags:

php

encoding

I would like to detect encoding of some text (using PHP). For that purpose i use mb_detect_encoding() function.

The problem is that the function returns different results if i change the order of possible encodings with mb_detect_order() function.

Consider the following example

$html = <<< STR
ちょっとのアクセスで落ちてしまったり、サーバー障害が多いレンタルサーバーを選ぶとあなたのビジネス等にかなりの影響がでてしまう可能性があります。特に商売をされている個人の方、法人の方は気をつけるようにしてください
STR;
mb_detect_order(array('UTF-8','EUC-JP', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));
$originalEncoding = mb_detect_encoding($str);
die($originalEncoding); // $originalEncoding = 'UTF-8'

However if you change the order of encodings in mb_detect_order() the results will be different:

mb_detect_order(array('EUC-JP','UTF-8', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));        
die($originalEncoding); // $originalEncoding = 'EUC-JP'

So my questions are:
Why is that happening ?
Is there a way in PHP to correctly and unambiguously detect encoding of text ?

964

asked May 21 '10 10:05

Termos

Video Answer

1 Answers

That's what I would expect to happen.

The detection algorithm probably just keeps trying, in order, the encodings you specified in mb_detect_order and then returns the first one under which the bytestream would be valid.

Something more intelligent requires statistical methods (I think machine learning is commonly used).

EDIT: See e.g. this article for more intelligent methods.

Due to its importance, automatic charset detection is already implemented in major Internet applications such as Mozilla or Internet Explorer. They are very accurate and fast, but the implementation applies many domain specific knowledges in case-by-case basis. As opposed to their methods, we aimed at a simple algorithm which can be uniformly applied to every charset, and the algorithm is based on well-established, standard machine learning techniques. We also studied the relationship between language and charset detection, and compared byte-based algorithms and character-based algorithms. We used Naive Bayes (NB) and Support Vector Machine (SVM).

answered Sep 23 '22 02:09

Artefacto

Related questions
                            
                                SOAP WSDL Associative Arrays
                            
                                Can a model have multiple tables in CakePHP?
                            
                                PHP include file extension best practices
                            
                                What are the Practical Differences Between "associate" and "indexed" Arrays in PHP?
                            
                                Effective compression in AS3 to be transeferred to PHP
                            
                                Can a PHP script start another PHP script and exit?
                            
                                apc.filters by path?
                            
                                PHP Conditional Followed by Closing Tag
                            
                                Header order question in PHP (and HTTP in general)
                            
                                How to modify existing excel file using PHP?
                            
                                Mysqli throws "Warning: mysqli_stmt_bind_param() expects parameter 1 to be mysqli_stmt, boolean given" [duplicate]
                            
                                How can i "unset" a class that has been declared
                            
                                What does "&" mean in this case?
                            
                                How to catch the exit() event in PHP?
                            
                                Remove Parent in PHP Multidimensional Array
                            
                                Debugging a Browser Redirect Loop
                            
                                Where can I find a web-project "security checklist?" [closed]
                            
                                Best XMPP Client PHP/Javascript [closed]
                            
                                What's the difference between those PHP if expressions?
                            
                                Which is more efficient/faster when calling a cached image?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With