UTF8 Filenames in PHP and Different Unicode Encodings

Tags:

I have a file containing Unicode characters on a server running linux. If I SSH into the server and use tab-completion to navigate to the file/folder containing unicode characters I have no problem accessing the file/folder. The problem arises when I try accessing the file via PHP (the function I was accessing the file system from was stat). If I output the path generated by the PHP script to the browser and paste it into the terminal the file also seems to exist (even though looking at the terminal the file paths are exactly the same).

I set PHP to use UTF8 as its default encoding via php_ini as well as set mb_internal_encoding. I checked the PHP filepath string encoding and it comes out as UTF8, as it should. Poking around a bit more I decided to hexdump the é character that the terminal's tab-completion and compare it to the hexdump of the 'regular' é character created by the PHP script or by manually entering in the character via keyboard (option+e+e on os x). Here is the result:

echo -n é | hexdump
0000000 cc65 0081                              
0000003
echo -n é | hexdump
0000000 a9c3                                   
0000002

The é character that allows a correct file reference in the terminal is the 3-byte one. I'm not sure where to go from here, what encoding should I use in PHP? Should I be converting the path to another encoding via iconv or mb_convert_encoding?

727

asked Jul 07 '09 01:07

iloveitaly

2 Answers

Thanks to the tips given in the two answers I was able to poke around and find some methods for normalizing the different unicode decompositions of a given character. In the situation I was faced with I was accessing files created by a OS X Carbon application. It is a fairly popular application and thus its file names seemed to adhere to a specific unicode decomposition.

In PHP 5.3 a new set of functions was introduced that allows you to normalize a unicode string to a particular decomposition. Apparently there are four decomposition standards which you can decompose you unicode string into. Python has had unicode normalization capabilties since version 2.3 via unicode.normalize. This article on python's handling of unicode strings was helpful in understanding encoding / string handling a bit better.

Here is a quick example on normalizing a unicode filepath:

filePath = unicodedata.normalize('NFD', filePath)

I found that the NFD format worked for all my purposes, I wonder if this is this is the standard decomposition for unicode filenames.

answered Sep 20 '22 06:09

iloveitaly

The three byte sequence is actually the utf8 representation of an e (0x65) followed by a combining ´ (0xcc 0x81), while 0xc3 0xa9 stands "directly" for é.
An utf-8 aware collation should be aware of the possible decompositions, but I don't know how you can enable that (and probably recompile the php source) on a mac.
Best I can offer is the "Using UTF-8 with Gentoo" description.

answered Sep 20 '22 06:09

VolkerK

Related questions
                            
                                Composer: Allowed memory size error when installing package
                            
                                html download attribute redirects to url instead of downloading
                            
                                How to avoid logging expected exceptions that are correctly converted to status codes by Api-Platform?
                            
                                Database exists but returns an error saying "Unknown Database"
                            
                                Yii2 - ActiveForm and afterValidate events
                            
                                Magento 2 session data gets deleted in google chrome
                            
                                Class Foo\Bar\Baz located in ./foo/bar/utility/baz.php does not comply with psr-4 autoloading standard. Skipping
                            
                                Add a new order status that Send an email notification in WooCommerce 4+
                            
                                Is it possible to connect PHP to SQL Server Compact Edition?
                            
                                Does PHP have a DEBUG symbol that can be used in code?
                            
                                Need to re-format phone number entries in a PHP Formmail script
                            
                                Locking a MySQL database so only one person at once can run a query?
                            
                                Ecommerce tool for selling services
                            
                                Is there a way to emulate the 'whois' tool using php?
                            
                                Is this wrapper for PDO 'good code' ? Are there any potential problems?
                            
                                How do I migrate my website, mySQL, php pages, files, settings etc to Amazon EC2? [closed]
                            
                                Assert multiple conditions in a single test, or split into multiple tests? [duplicate]
                            
                                How to convince fellow programmers to use Gettext?
                            
                                Can a model link to a View in the database instead of a table in CakePHP?
                            
                                PHP regex templating - find all occurrences of {{var}}

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

UTF8 Filenames in PHP and Different Unicode Encodings

Tags:

php

encoding

unicode

utf-8

filepath

iloveitaly

People also ask

2 Answers

iloveitaly

VolkerK

Recent Activity

Donate For Us