Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read unicode text-file in PHP?

I have some trouble reading in a text file (saved in Unicode UTF16-LE) in my PHP script.

My PHP script is saved (for some reasons) in UTF-8.

Here is my code:

$lines = file("./somedir/$filename");

for ($i=0; $i < count($lines); $i++) {
    $lines[$i] = iconv("Unicode", "UTF-8", $lines[$i]); // converting to UTF8
}

echo "[0]:".$lines[0]; // outputs CORRECT text (like "This is the first line")
echo "[1]:".$lines[1]; // outputs something like çæ¤ææ¬çææ¸ææ°ã

Any idea please? I checked value of count($lines) and it's perfectly correct... Thanks.

EDIT:
OK so I tried iconv("UTF-16", "UTF-8", $lines[$i]);
I also tried iconv("UTF-16LE", "UTF-8", $lines[$i]);
But still no success...

like image 997
Enriqe Avatar asked Feb 26 '13 15:02

Enriqe


2 Answers

PHP's file function is not able to read files with the UTF-16LE encoding. It needs to split on the line ending character but PHP does only support single-byte sequences here, UTF-16LE is a multibyte variable-length encoding that is incompatible with the line-splitting procedures encoded into the file function.

So you are using the wrong function for the job. That simple is the answer. Not iconv is the problem here, but just using file.

Instead you need to read in the file into a buffer, get one line after the other out of the buffer and the do the re-encoding to UTF-8.

That starts by learning about the line-separator used in that file. As PHP's file-functions (and string functions as well as the strings itself) are binary based, take the binary sequence in form of a string and the strpos function to locate it.

Then split line by line out of the buffer (re-fill the buffer again from the file if it runs out of bytes) and then you can use iconv as outlined in the manual page (or your question, the example code you have is not looking wrong, just take care you use the right parameters so the encodings are correct).

like image 103
hakre Avatar answered Sep 30 '22 14:09

hakre


The following code works for me:

Just use the following function fopen_utf8 instead of fopen.

<?php
# http://www.practicalweb.co.uk/blog/2008/05/18/reading-a-unicode-excel-file-in-php/
function fopen_utf8($filename){
    $encoding='';
    $handle = fopen($filename, 'r');
    $bom = fread($handle, 2);
//  fclose($handle);
    rewind($handle);

    if($bom === chr(0xff).chr(0xfe)  || $bom === chr(0xfe).chr(0xff)){
            // UTF16 Byte Order Mark present
            $encoding = 'UTF-16';
    } else {
        $file_sample = fread($handle, 1000) + 'e'; //read first 1000 bytes
        // + e is a workaround for mb_string bug
        rewind($handle);

        $encoding = mb_detect_encoding($file_sample , 'UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS, ISO-2022-JP');
    }
    if ($encoding){
        stream_filter_append($handle, 'convert.iconv.'.$encoding.'/UTF-8');
    }
    return  ($handle);
} 
?>

From this website

like image 20
Dubbo Avatar answered Sep 30 '22 16:09

Dubbo