Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In what encoding does readdir return a filename?

Here's a Perl script that I have expected to print found when executed:

#!/usr/bin/perl
use warnings;
use strict;
use utf8;
use Encode;

use constant filename => 'Bärlauch';

open (my $out, '>', filename) or die;
close $out;

opendir(my $dir, '.') or die;
while (my $filename_read = readdir($dir)) {
# $filename_read = encode('utf8', $filename_read);
  print "found\n" if $filename_read eq filename;
}

The script first creates a file with the name of the constant filename. (After running the script, I can verify the existence of the file with ls and the file is not created with "funny" characters.)

Then the script iterates over the files in the the current working directory and prints found if there is a file whose name is equal to the file just created. This should obviously be the case.

However, it doesn't (Ubuntu, bash, LANG=en_US.UTF8)

If I change the constant to Barlauch, it works as expected and prints found.

Uncommenting $filename_read = encode('utf8', $filename_read); does not change the behavior.

Is there an explanation for this and what do I do have to do in order to recognize a filename with Umlaute in it?

like image 513
René Nyffenegger Avatar asked May 04 '16 11:05

René Nyffenegger


1 Answers

The question rephrased (as I interpret it) is:

Why doesn't readdir return back the newly created filename? (Here, represented by the variable filename which is set to Bärlauch).

(Note: filename is a Perl constant variable, so that's why it's missing the $ sigil in front.)

Background:

First note: due to the use utf8 statement in the beginning of your program, filename will be upgraded to a Unicode string at compile time, since it contain non-ASCII characters. From the documentation of the utf8 pragma:

Enabling the utf8 pragma has the following effect: Bytes in the source text that are not in the ASCII character set will be treated as being part of a literal UTF-8 sequence. This includes most literals such as identifier names, string constants, and constant regular expression patterns.

and also, according to perluniintro section "Perl's Unicode Model" :

The general principle is that Perl tries to keep its data as eight-bit bytes for as long as possible, but as soon as Unicodeness cannot be avoided, the data is transparently upgraded to Unicode.

...

Internally, Perl currently uses whatever the native eight-bit character set of the platform (for example Latin-1) is, defaulting to UTF-8, to encode Unicode strings.

The non-ASCII character in filename is the letter ä. If you use ISO 8859-1 extended ASCII encoding (Latin-1), it is encoded as the byte value 0xE4, see this table at ascii-code.com. However, if you removed the ä character from filename, it would contain only ASCII characters, and therefore it would not be internally upgraded to Unicode, even if you used the utf8 pragma.

So filename is now a Unicode string with the internal UTF-8 flag set ( see utf8 pragma for more information on the UTF-8 flag). Note that the letter ä is encoded in UTF-8 as the two bytes 0xC3 0xA4.

Writing the file:

When writing the file, what happens with the filename? If filename is a Unicode string, it will be encoded as UTF-8. However, note that it is not necessary to encode filename first (encode_utf8( filename )). See Creating filenames with unicode characters for more information. So the filename is written to disk as UTF-8 encoded bytes.

Reading the filename back:

When trying to read the filename back from disk, readdir does not return Unicode strings (strings with the UTF-8 flag set) even if the filename contains bytes encoded in UTF-8. It returns binary or byte strings, see perlunitut for a discussion of byte strings vs character (Unicode) strings.

Why doesn't readdir return Unicode strings? First, according to perlunicode section "When Unicode Does Not Happen" :

There are still many places where Unicode (in some encoding or another) could be given as arguments or received as results, or both in Perl, but it is not. (...)

The following are such interfaces. For all of these interfaces Perl currently (as of v5.16.0) simply assumes byte strings both as arguments and results. (...)

One reason that Perl does not attempt to resolve the role of Unicode in these situations is that the answers are highly dependent on the operating system and the file system(s). For example, whether filenames can be in Unicode and in exactly what kind of encoding, is not exactly a portable concept. (...)

  • chdir, chmod, chown, chroot, exec, link, lstat, mkdir, rename, rmdir, - stat, symlink, truncate, unlink, utime, -X
  • %ENV
  • glob (aka the <*>)
  • open, opendir, sysopen
  • qx (aka the backtick operator), system
  • readdir, readlink

So readdir returns byte strings, since it is in general impossible to know the encoding of a file name a priori. For background information about why this is impossible, see for example:

  • filename in Wikipedia, sub section "Encoding interoperability",
  • Understanding Unix file name encoding on unix.stackexchange.com

String comparison:

Now, finally you try to compare the read filename $filename_read with the variable filename:

print "found\n" if $filename_read eq filename;

In this case the only difference between $filename_read and filename is that $filename_read does not have the UTF-8 flag set (it is not what Perl internally recognize as a "Unicode string").

The interesting thing now is that the result of the eq operator will depend upon whether the bytes in $filename_read is pure ASCII or not. According to the documentation of the Encode module:

Before the introduction of Unicode support in Perl, The eq operator just compared the strings represented by two scalars. Beginning with Perl 5.8, eq compares two strings with simultaneous consideration of the UTF8 flag.

...

When you decode, the resulting UTF8 flag is on--unless you can unambiguously represent data.

So in your case, eq will consider the UTF-8 flag since $file_name_read does not contain pure ASCII, and as a result it will consider the two string not equal. If $filename_read and filename where identical and did only contain pure ASCII bytes (and filename still had the UTF-8 flag set, $filename_read did not have the UTF-8 flag set), then eq would consider the two strings as equal. Se the discussion in the documentation for Encode more information regarding the background for this behavior.

Conclusion:

So if you are relative confident that all your filenames are UTF-8 encoded, you could solve the issue in your question by decoding the byte string returned from readdir into a Unicode string (forcing the UTF-8 flag to be set):

$filename_read = Encode::decode_utf8( $filename_read );

More details

Note: since Unicode allows multiple representations of the same characters, there exists two forms of the ä (LATIN SMALL LETTER A WITH COMBINING DIAERESIS) in Bärlauch. For example,

  • U+00E4 is the NFC (Normalization Form canonical Composition) form,
  • U+0061.0308 is the NFD (Normalization Form canonical Decomposition) form.

On my platform (Linux), UTF-8 encoded filenames are stored using NFC form, but on Mac OS they use NFD form. See Encode::UTF8Mac for more information. This means that if you work on a Linux machine, and for example clone a Git repository that was created by a Mac user, you can easily get NFD encoded filenames on your Linux machine. So the Linux filesystem does not care what encoding a filename is in; it just thinks of it as a sequence of bytes. Hence, I could easily write a script that created an ISO-Latin-1 encoded filename, even though my Locale is "en_US.UTF-8". The current locale settings are just guidelines for applications, but if the application ignores the locale settings it is nothing that stops them from doing that.

So if you are unsure if filenames returned from readdir are using NFC or NFD, you should always decompose after you have decoded them:

use Unicode::Normalize;
print "found\n" if NFD( $filename_read ) eq NFD( filename );

See also Perl Unicode Cookbook section "Always Decompose and Recompose".

Finally, to understand more about how the Locale works together with Unicode in Perl, you could have a look at:

  • perllocale, section "Unicode and UTF-8", and
  • Encode::Locale.
like image 167
Håkon Hægland Avatar answered Sep 28 '22 19:09

Håkon Hægland