Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP scandir() and htmlentities(): issues with charset and/or special characters

I am using jqueryFileTree to show a directory listing on the server with download links to the files in the directory. Recently I've run into an issue with files which contain special characters:

  • test.pdf : works fine
  • tést.pdf : does not work (notice the é - acute accent - in the filename)

When debugging the php connector of jqueryFileTree, I see it's doing a scandir() of the directory passed via $_GET, and then looping over each file/dir of the directory. Before parsing the filename into the url, the script seems to correctly perform a htmlentities() over the file name. The problem seems to be that this htmlentities($file) call just returns an empty string, which according to the php docs this can be the case when the input string contains an invalid code unit within the given encoding. However i tried passing the charset implicitly by calling:

$file = htmlentities($file,ENT_QUOTES,'UTF-8');

But this also returns an empty string.

If I call: $file = htmlentities($file,ENT_IGNORE,'UTF-8'); The e acute character is just dropped (so tést.pdf becomes tst.pdf)

When debugging my php script with xdebug I can see the source string contains an unknown character (looks like this).

So I'm quite at my wits end here to find the solution for this. Any help would be welcome.

FYI:

  • The charset of my page is UTF-8 (specified in metadata)
  • The file is stored on a windows 2003 fileserver and scandir() is executed with the UNC path (e.g. //fileserver/sharename/sourcedir)
  • The default encoding in my php.ini is set to UTF-8
  • The webserver & PHP 5.4.26 are running on a windows 2008 R2 server
like image 868
Alex Avatar asked Mar 26 '14 12:03

Alex


1 Answers

My best guess is that the filename itself isn't using UTF-8. Or at least scandir() isn't picking it up like that.

Maybe mb_detect_encoding() can shed some light?

var_dump(mb_detect_encoding($filename));

If not, try to guess the encoding (CP1252 or ISO-8859-1 would be my first guess) and convert it to UTF-8, see if the output is valid:

var_dump(mb_convert_encoding($filename, 'UTF-8', 'Windows-1252'));
var_dump(mb_convert_encoding($filename, 'UTF-8', 'ISO-8859-1'));
var_dump(mb_convert_encoding($filename, 'UTF-8', 'ISO-8859-15'));

Or using iconv():

var_dump(iconv('WINDOWS-1252', 'UTF-8', $filename));
var_dump(iconv('ISO-8859-1',   'UTF-8', $filename));
var_dump(iconv('ISO-8859-15',  'UTF-8', $filename));

Then when you've figured out which encoding is actually used, your code should look somewhat like this (assuming CP1252):

$filename = htmlentities(mb_convert_encoding($filename, 'UTF-8', 'Windows-1252'), ENT_QUOTES, 'UTF-8');
like image 132
Jasper N. Brouwer Avatar answered Sep 21 '22 05:09

Jasper N. Brouwer