Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP7 UTF-8 filenames on Windows server, new phenomenon caused by ZipArchive

Update:

Preparing a bug report to the great people that make PHP 7 possible I revised my research once more and tried to melt it down to a few simple lines of code. While doing this I found that PHP itself is not the cause of the problem. I will share my results here when I'm done. Just so you know and don't possibly waste your time or something :)


Synopsis: PHP7 now seems able to write UTF-8 filenames but is unable to access them?

Preamble: I read about 10-15 articles here touching the subject but they did not help me solve the problem and they all are older than the PHP7 release. It seems to me that this is probably a new issue and I wonder if it might be a bug. I spent a lot of time experimenting with en-/decoding of the strings and trying to figure out a way to make it work - to no avail.

Good day everybody and greetings from Germany (insert shy not-my-native-language-remark here), I hope you can help me out with this new phenomenon I encountered. It seems to be "new" in the sense that it came with PHP 7.

I think most people working with PHP on a Windows system are very familiar with the problem of filenames and the transparent wrapper of PHP that manages access to files that have non-ASCII filenames (or windows-1252 or whatever is the system code page).

I'm not quite sure how to approach the subject and as you can see I'm not very experienced in composing questions so please don't rip my head off instantly. And yes I will strive to keep it short. Here we go:

First symptom: after updating to PHP7 I sometimes encountered problems with accessing files generated by my software. Sometimes it worked as usual, sometimes not. I found out the difference was that PHP7 now seems able to write UTF-8 filenames but is unable to access files with those names.

After generating said files on two separate "identical" systems (differing only in the PHP version) this is how the files are named on the hard drive:

PHP 5.5: Lokaltest_KG_漢字_汉字_Krümhold-DEZ1604-140081-complete.zip

PHP 7: Lokaltest_KG_漢字_汉字_Krümhold-DEZ1604-140081-complete.zip

Splendid, PHP 7 is capable of writing unicode-filenames on the HDD, and UTF-16 is used on windows afaik. Now the downside is that when I try to access those files for example with is_file() PHP 5.5 works but PHP 7 does not.

Consider this code snippet (note: I "hacked" into this function because it was the simplest way, it was not written for this purpose). This function gets called after a zip-file gets generated taking on the name of the customer and other values to determine a proper name. Those come out of the database. Database and internal encoding of PHP are both UTF-8. clearstatcache is per se not necessary but I included it to make things clearer. Important: Everything that happens is done with PHP7, no other entity is responsible for creating the zip-file. To be precise it is done with class ZipArchive. Actually it does not even matter that it is a zip-archive, the point is that the filename and the content of the file are created by PHP7 - successfully.

public static function downloadFileAsStream( $file )
{
    clearstatcache();
    print $file . "<br/>";
    var_dump(is_file($file));
    die();
}       

Output is:

D:/htdocs/otm/.data/_tmp/Lokaltest_KG_漢字_汉字_Krümhold-DEZ1604-140081-complete.zip
bool(false) 

So PHP7 is able to generate the file - they indeed DO exist on the harddrive and are legit and accessible and all - but is incapable of accessing them. is_file is not the only function that fails, file_exists() does too for example.

A little experiment with encoding conversion to give you a taste of the things I tried:

public static function downloadFileAsStream( $file )
{
    clearstatcache();
    print $file . "<br/>";
    print mb_detect_encoding($file, 'ASCII,UTF-16,windows-1252,UTF-8', false) . "<br/>";
    print mb_detect_encoding($file, 'ASCII,UTF-16,windows-1252,UTF-8', true) . "<br/>";

    if (($detectedEncoding = mb_detect_encoding($file, 'ASCII,UTF-16,windows-1252,UTF-8', true)) != 'windows-1252')
    {
        $file = mb_convert_encoding($file, 'UTF-16', $detectedEncoding);
    }

    print $file . "<br/>";
    var_dump(is_file($file));
    die();
}       

Output is:

D:/htdocs/otm/.data/_tmp/Lokaltest_KG_漢字_汉字_Krümhold-DEZ1604-140081-complete.zip
UTF-8
UTF-8
D:/htdocs/otm/.data/_tmp/Lokaltest_KG_o"[W_lI[W_Kr�mhold-DEZ1604-140081-complete.zip
NULL 

So converting from UTF-8 (database/internal encoding) to UTF-16 (windows file system) does not seem to work either.

I am at the end of my rope here and sadly the issue is very important to us since we cannot update our systems with this problem looming in the background. I hope somebody can shed a little light on this. Sorry for the long post, I'm not sure how well I could get my point across.


Addition:

$file = utf8_decode($file);
var_dump(is_file($file));
die();

Delivers false for the filename with the japanese letters. When I change the input used to create the filename so that the filename now is Lokaltest_KG_Krümhold-DEZ1604-140081-complete.zip above code delivers true. So utf8_decode helps but only with a small part of unicode, german umlauts?

like image 870
thomasKberlin Avatar asked May 10 '16 12:05

thomasKberlin


1 Answers

Answering my own question here: The actual bad boy was the component ZipArchive which created files with incorrectly encoded filenames. I have written a hopefully helpful bug report: https://bugs.php.net/bug.php?id=72200

Consider this short script:

print "php default_charset: ".ini_get('default_charset')."\n"; // just 4 info (UTF-8)

$filename = "bugtest_müller-lüdenscheid.zip"; // just an example
$filename = utf8_encode($filename); // simulating my database delivering utf8-string

$zip = new ZipArchive();
if( $zip->open($filename, ZipArchive::CREATE | ZipArchive::OVERWRITE) === true )
{
    $zip->addFile('bugtest.php', 'bugtest.php'); // copy of script file itself
    $zip->close();
}

var_dump( is_file($filename) );  // delivers ?

output:

output PHP 5.5.35:
    php default_charset: UTF-8
    bool(true)

output PHP 7.0.6:
    php default_charset: UTF-8
    bool(false)
like image 63
thomasKberlin Avatar answered Nov 15 '22 01:11

thomasKberlin