Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP basename() and pathinfo() with Multibytes UTF-8 file names

I've found that PHP function basename(), as well as pathinfo() have a strange behaviour with multibyte utf-8 names. They remove all non-Latin characters until the first Latin character or punctuation sign. However, after that, subsequent non-Latin characters are preserved.

basename("àxà"); // returns "xà", I would expect "àxà" or just "x" instead
pathinfo("àyà/àxà", PATHINFO_BASENAME); // returns "xà", same as above

but curiously the dirname part of pathinfo() works fine:

pathinfo("àyà/àxà", PATHINFO_DIRNAME); // returns "àyà"

PHP documentation warns that basename() and pathinfo() functions are locale aware, but this does not justify the inconsistency between pathinfo(..., PATHINFO_BASENAME) and pathinfo(..., PATHINFO_DIRNAME), not to mention the fact that identical non Latin characters are being either discarded or accepted, depending on their position relative to Latin characters.

It sounds like a PHP bug.

Since "basename" checks are really important for security concerns to avoid directoy traversal, is there any reliable basename filter that works decently with unicode input?

like image 839
Demis Palma ツ Avatar asked Jul 23 '17 18:07

Demis Palma ツ


People also ask

What does Pathinfo do in PHP?

The pathinfo() function returns information about a file path.

What is the use of Basename function in PHP?

The basename() function returns the filename from a path.


1 Answers

I've found that changing the locale fixes everything.

While Apache by default runs with "C" locale, cli scripts by default run with an utf-8 locale instead, such as "en_US.UTF-8" (or in my case "it_IT.UTF-8"). Under these conditions, the problem does not occur.

Therefore, the workaround on Apache consists in changing the locale from "C" to "C.UTF-8" before calling these functions.

setlocale(LC_ALL,'C.UTF-8');
basename("àxà"); // now returns "àxà", which is correct
pathinfo("àyà/àxà", PATHINFO_BASENAME); // now returns "àxà", which is correct

Or even better, if you want to backup the current locale and restore it once done:

$lc = new LocaleManager();
$lc->doBackup();
$lc->fixLocale();
basename("àxà/àyà");
$lc->doRestore();


class LocaleManager
{
    /** @var array */
    private $backup;


    public function doBackup()
    {
        $this->backup = array();
        $localeSettings = setlocale(LC_ALL, 0);
        if (strpos($localeSettings, ";") === false)
        {
            $this->backup["LC_ALL"] = $localeSettings;
        }
        // If any of the locales differs, then setlocale() returns all the locales separated by semicolon
        // Eg: LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=C;...
        else
        {
            $locales = explode(";", $localeSettings);
            foreach ($locales as $locale)
            {
                list ($key, $value) = explode("=", $locale);
                $this->backup[$key] = $value;
            }
        }
    }


    public function doRestore()
    {
        foreach ($this->backup as $key => $value)
        {
            setlocale(constant($key), $value);
        }
    }


    public function fixLocale()
    {
        setlocale(LC_ALL, "C.UTF-8");
    }
}
like image 154
Demis Palma ツ Avatar answered Oct 10 '22 18:10

Demis Palma ツ