Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to iterate over non-English file names in PHP

I have a directory which contains several files, many of which has non-english name. I am using PHP in Windows 7.

I want to list the filename and their content using PHP.

Currently I am using DirectoryIterator and file_get_contents. This works for English files names but not for non-English (chinese) file names.

For example, I have filenames like "एक और प्रोब्लेम.eml", "hello 鶨鶖鵨鶣鎹鎣.eml".

  1. DirectoryIterator is not able to get the filename using ->getFilename()
  2. file_get_contents is also not able to open even if I hard code the filename in its parameter.

How can I do it?

like image 678
Sabya Avatar asked Jun 01 '10 07:06

Sabya


2 Answers

This is not possible. It's a limitation of PHP. PHP uses the multibyte versions of Windows APIs; you're limited to the characters your codepage can represent.

See this answer.

Directory contents:

D:\Users\Cataphract\Desktop\teste2>dir
 Volume in drive D is GRANDEDISCO
 Volume Serial Number is 945F-DB89

 Directory of D:\Users\Cataphract\Desktop\teste2

01-06-2010  17:16              .
01-06-2010  17:16              ..
01-06-2010  17:15                 0 coptic small letter shima follows ϭ.txt
01-06-2010  17:18                86 teste.php
               2 File(s)             86 bytes
               2 Dir(s)  12.178.505.728 bytes free

Test file contents:

<?php
exec('pause');
foreach (new DirectoryIterator(".") as $v) {
    echo $v."\n";
}

Test file results:

.
..
coptic small letter shima follows ?.txt
teste.php

Debugger output:

Call stack (PHP 5.3.0):

>   php5ts_debug.dll!readdir_r(DIR * dp=0x02f94068, dirent * entry=0x00a7e7cc, dirent * * result=0x00a7e7c0)  Line 80   C
    php5ts_debug.dll!php_plain_files_dirstream_read(_php_stream * stream=0x02b94280, char * buf=0x02b9437c, unsigned int count=260, void * * * tsrm_ls=0x028a15c0)  Line 820 + 0x17 bytes   C
    php5ts_debug.dll!_php_stream_read(_php_stream * stream=0x02b94280, char * buf=0x02b9437c, unsigned int size=260, void * * * tsrm_ls=0x028a15c0)  Line 603 + 0x1c bytes  C
    php5ts_debug.dll!_php_stream_readdir(_php_stream * dirstream=0x02b94280, _php_stream_dirent * ent=0x02b9437c, void * * * tsrm_ls=0x028a15c0)  Line 1806 + 0x16 bytes    C
    php5ts_debug.dll!spl_filesystem_dir_read(_spl_filesystem_object * intern=0x02b94340, void * * * tsrm_ls=0x028a15c0)  Line 199 + 0x20 bytes  C
    php5ts_debug.dll!spl_filesystem_dir_open(_spl_filesystem_object * intern=0x02b94340, char * path=0x02b957f0, void * * * tsrm_ls=0x028a15c0)  Line 238 + 0xd bytes   C
    php5ts_debug.dll!spl_filesystem_object_construct(int ht=1, _zval_struct * return_value=0x02b91f88, _zval_struct * * return_value_ptr=0x00000000, _zval_struct * this_ptr=0x02b92028, int return_value_used=0, void * * * tsrm_ls=0x028a15c0, long ctor_flags=0)  Line 645 + 0x11 bytes  C
    php5ts_debug.dll!zim_spl_DirectoryIterator___construct(int ht=1, _zval_struct * return_value=0x02b91f88, _zval_struct * * return_value_ptr=0x00000000, _zval_struct * this_ptr=0x02b92028, int return_value_used=0, void * * * tsrm_ls=0x028a15c0)  Line 658 + 0x1f bytes   C
    php5ts_debug.dll!zend_do_fcall_common_helper_SPEC(_zend_execute_data * execute_data=0x02bc0098, void * * * tsrm_ls=0x028a15c0)  Line 313 + 0x78 bytes   C
    php5ts_debug.dll!ZEND_DO_FCALL_BY_NAME_SPEC_HANDLER(_zend_execute_data * execute_data=0x02bc0098, void * * * tsrm_ls=0x028a15c0)  Line 423  C
    php5ts_debug.dll!execute(_zend_op_array * op_array=0x02b93888, void * * * tsrm_ls=0x028a15c0)  Line 104 + 0x11 bytes    C
    php5ts_debug.dll!zend_execute_scripts(int type=8, void * * * tsrm_ls=0x028a15c0, _zval_struct * * retval=0x00000000, int file_count=3, ...)  Line 1188 + 0x21 bytes C
    php5ts_debug.dll!php_execute_script(_zend_file_handle * primary_file=0x00a7fad4, void * * * tsrm_ls=0x028a15c0)  Line 2196 + 0x1b bytes C
    php.exe!main(int argc=2, char * * argv=0x028a14c0)  Line 1188 + 0x13 bytes  C
    php.exe!__tmainCRTStartup()  Line 555 + 0x19 bytes  C
    php.exe!mainCRTStartup()  Line 371  C

Is it really a question mark?

dp->fileinfo
{dwFileAttributes=32 ftCreationTime={...} ftLastAccessTime={...} ...}
    dwFileAttributes: 32
    ftCreationTime: {dwLowDateTime=2784934701 dwHighDateTime=30081445 }
    ftLastAccessTime: {dwLowDateTime=2784934701 dwHighDateTime=30081445 }
    ftLastWriteTime: {dwLowDateTime=2784934701 dwHighDateTime=30081445 }
    nFileSizeHigh: 0
    nFileSizeLow: 0
    dwReserved0: 3435973836
    dwReserved1: 3435973836
    cFileName: 0x02f9409c "coptic small letter shima follows ?.txt"
    cAlternateFileName: 0x02f941a0 "COPTIC~1.TXT"
dp->fileinfo.cFileName[34]
63 '?'

Yes! It's character #63.

like image 73
Artefacto Avatar answered Sep 17 '22 13:09

Artefacto


Short reply:

Under Windows, you cannot access arbitrary file names with PHP; you are limited to those file names whose name can be represented with the currently selected "code page" (see Regional and Language Options", "Format" panel and "Administrative" tab panel "Language for non-Unicode programs").

Longer reply:

Windows uses UTF-16 for file encoding since Win2000, but PHP communicate with the underlying file system as a "non-Unicode aware program". This means that there is a current "code page table" that tranlates from PHP strings to UTF-16 strings and vice-versa. From PHP the current code page can be retrieved by setlocale() in the form "language_country.codepage", for example:

setlocale(LC_CTYPE, 0) ==> "english_United States.1252"

where 1252 is the Windows code page table currently selected from the control panel; file names retrieved from the file system are encoded using that code page; file names generated from PHP must be encoded according to that code page. Things are even more complicated by the fact that UTF-16 file names are traslated to PHP strings using the "best-fit code page", that is an approxymated representation of the actual characters/words, so you cannot trust on file names and paths retrieved from the file system as they might be arbitrarily mangled.

References:

http://en.wikipedia.org/wiki/Windows_code_page What "Windows code pages" are.

https://bugs.php.net/bug.php?id=47096 More details about this issue.

like image 33
Umberto Salsi Avatar answered Sep 21 '22 13:09

Umberto Salsi