Here's a Perl script that I have expected to print found
when executed:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
use Encode;
use constant filename => 'Bärlauch';
open (my $out, '>', filename) or die;
close $out;
opendir(my $dir, '.') or die;
while (my $filename_read = readdir($dir)) {
# $filename_read = encode('utf8', $filename_read);
print "found\n" if $filename_read eq filename;
}
The script first creates a file with the name of the constant filename
. (After running the script, I can verify the existence of the file with ls
and the file is not created with "funny" characters.)
Then the script iterates over the files in the the current working directory and prints found
if there is a file whose name is equal to the file just created. This should obviously be the case.
However, it doesn't (Ubuntu, bash, LANG=en_US.UTF8
)
If I change the constant to Barlauch
, it works as expected and prints found
.
Uncommenting $filename_read = encode('utf8', $filename_read);
does not change the behavior.
Is there an explanation for this and what do I do have to do in order to recognize a filename with Umlaute in it?
The question rephrased (as I interpret it) is:
Why doesn't
readdir
return back the newly created filename? (Here, represented by the variablefilename
which is set toBärlauch
).
(Note: filename
is a Perl constant variable, so that's why it's missing the $
sigil in front.)
Background:
First note: due to the use utf8
statement in the beginning of your program, filename
will be upgraded to a Unicode string at compile time, since it contain non-ASCII characters. From the documentation of the utf8 pragma:
Enabling the utf8 pragma has the following effect: Bytes in the source text that are not in the ASCII character set will be treated as being part of a literal UTF-8 sequence. This includes most literals such as identifier names, string constants, and constant regular expression patterns.
and also, according to perluniintro section "Perl's Unicode Model" :
The general principle is that Perl tries to keep its data as eight-bit bytes for as long as possible, but as soon as Unicodeness cannot be avoided, the data is transparently upgraded to Unicode.
...
Internally, Perl currently uses whatever the native eight-bit character set of the platform (for example Latin-1) is, defaulting to UTF-8, to encode Unicode strings.
The non-ASCII character in filename
is the letter ä
. If you use ISO 8859-1 extended ASCII encoding (Latin-1), it is encoded as the byte value 0xE4
, see this table at ascii-code.com
.
However, if you removed the ä
character from filename
, it would contain only ASCII characters, and therefore it would not be internally upgraded to Unicode, even if you used the utf8
pragma.
So filename
is now a Unicode string with the internal UTF-8
flag set ( see utf8 pragma for more information on the UTF-8
flag). Note that the letter ä
is encoded in UTF-8 as the two bytes 0xC3 0xA4
.
Writing the file:
When writing the file, what happens with the filename? If filename
is a Unicode string, it will be encoded as UTF-8. However, note that it is not necessary to encode filename
first (encode_utf8( filename )
). See Creating filenames with unicode characters for more information. So the filename is written to disk as UTF-8 encoded bytes.
Reading the filename back:
When trying to read the filename back from disk, readdir
does not return Unicode strings (strings with the UTF-8 flag set) even if the filename contains bytes encoded in UTF-8. It returns binary or byte strings, see perlunitut for a discussion of byte strings vs character (Unicode) strings.
Why doesn't readdir
return Unicode strings? First, according to
perlunicode section "When Unicode Does Not Happen" :
There are still many places where Unicode (in some encoding or another) could be given as arguments or received as results, or both in Perl, but it is not. (...)
The following are such interfaces. For all of these interfaces Perl currently (as of v5.16.0) simply assumes byte strings both as arguments and results. (...)
One reason that Perl does not attempt to resolve the role of Unicode in these situations is that the answers are highly dependent on the operating system and the file system(s). For example, whether filenames can be in Unicode and in exactly what kind of encoding, is not exactly a portable concept. (...)
- chdir, chmod, chown, chroot, exec, link, lstat, mkdir, rename, rmdir, - stat, symlink, truncate, unlink, utime, -X
- %ENV
- glob (aka the <*>)
- open, opendir, sysopen
- qx (aka the backtick operator), system
- readdir, readlink
So readdir
returns byte strings, since it is in general impossible to know the encoding of a file name a priori. For background information about why this is impossible, see for example:
String comparison:
Now, finally you try to compare the read filename $filename_read
with the variable filename
:
print "found\n" if $filename_read eq filename;
In this case the only difference between $filename_read
and filename
is that $filename_read
does not have the UTF-8 flag set (it is not what Perl internally recognize as a "Unicode string").
The interesting thing now is that the result of the eq
operator will depend upon whether the bytes in $filename_read
is pure ASCII or not. According to the documentation of the Encode module:
Before the introduction of Unicode support in Perl, The
eq
operator just compared the strings represented by two scalars. Beginning with Perl 5.8,eq
compares two strings with simultaneous consideration of the UTF8 flag....
When you decode, the resulting UTF8 flag is on--unless you can unambiguously represent data.
So in your case, eq
will consider the UTF-8
flag since $file_name_read
does not contain pure ASCII, and as a result it will
consider the two string not equal. If $filename_read
and filename
where identical and did only contain pure ASCII bytes (and filename
still had the UTF-8 flag set, $filename_read
did not have the UTF-8 flag set), then eq
would consider the two strings as equal. Se the discussion in the documentation for Encode more information regarding the background for this behavior.
Conclusion:
So if you are relative confident that all your filenames are UTF-8 encoded, you could solve the issue in your question by decoding the byte string returned from readdir
into a Unicode string (forcing the UTF-8 flag to be set):
$filename_read = Encode::decode_utf8( $filename_read );
More details
Note: since Unicode allows multiple representations of the same characters, there exists two forms of the ä
(LATIN SMALL LETTER A WITH COMBINING DIAERESIS) in Bärlauch
. For example,
On my platform (Linux), UTF-8 encoded filenames are stored using NFC form, but on Mac OS they use NFD form. See Encode::UTF8Mac
for more information. This means that if you work on a Linux machine, and for example clone a Git repository that was created by a Mac user, you can easily get NFD encoded filenames on your Linux machine. So the Linux filesystem does not care what encoding a filename is in; it just thinks of it as a sequence of bytes. Hence, I could easily write a script that created an ISO-Latin-1 encoded filename, even though my Locale is "en_US.UTF-8"
. The current locale settings are just guidelines for applications, but if the application ignores the locale settings it is nothing that stops them from doing that.
So if you are unsure if filenames returned from readdir
are using NFC or NFD, you should always decompose after you have decoded them:
use Unicode::Normalize;
print "found\n" if NFD( $filename_read ) eq NFD( filename );
See also Perl Unicode Cookbook section "Always Decompose and Recompose".
Finally, to understand more about how the Locale works together with Unicode in Perl, you could have a look at:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With