Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating filenames with unicode characters

I am looking for some guidelines for how to create filenames with Unicode characters. Consider:

use open qw( :std :utf8 );
use strict;
use utf8;
use warnings;

use Data::Dump;
use Encode qw(encode);

my $utf8_file_name1 = encode('UTF-8', 'æ1', Encode::FB_CROAK | Encode::LEAVE_SRC);
my $utf8_file_name2 = 'æ2';
dd $utf8_file_name1;
dd $utf8_file_name2;
qx{touch $utf8_file_name1};
qx{touch $utf8_file_name2};
print (qx{ls æ*});

The output is:

"\xC3\xA61"
"\xE62"
æ1
æ2

Why doesn't it matter if I encode the filename in UTF8 or not? (The filename still becomes valid UTF8 either way.)

like image 571
Håkon Hægland Avatar asked Jul 12 '15 18:07

Håkon Hægland


People also ask

How do I create a special character in a filename?

Double your \ , like this: \\ , so that your shell does not interpret the backslashes from your filename as escape characters. Escape " and ' , like this: \" , \' , so that your shell interprets the double quotes as part of the filename.

Can filename have special characters?

Illegal Filename CharactersDon't start or end your filename with a space, period, hyphen, or underline. Keep your filenames to a reasonable length and be sure they are under 31 characters. Most operating systems are case sensitive; always use lowercase. Avoid using spaces and underscores; use a hyphen instead.

What does %20 mean in a file name?

Appendix B of NARA Bulletin 2015-04 states that spaces aren't allowed in filenames. Web environments translate spaces and will render them as “%20”. For example, “File Name. doc” would appear on-line in the URL as “File%20Name.

How do I create a valid file name?

Supported characters for a file name are letters, numbers, spaces, and ( ) _ - , . *Please note file names should be limited to 100 characters. Characters that are NOT supported include, but are not limited to: @ $ % & \ / : * ? " ' < > | ~ ` # ^ + = { } [ ] ; !


1 Answers

Because of a bug called "The Unicode Bug". The equivalent of the following is happening:

use Encode qw( encode_utf8 is_utf8 );

my $bytes = is_utf8($str) ? encode_utf8($str) : $str;

is_utf8 checks which of two string storage format is used by the scalar. This is an internal implementation detail you should never have to worry about, except for The Unicode Bug.

Your program works because encode always returns a string for which is_utf8 returns false, and use utf8; always returns a string for which is_utf8 returns true if the string contains non-ASCII characters.

If you don't encode as you should, you will sometimes get the wrong result. For example, if you had used "\x{E6}2" instead of 'æ2', you would have gotten a different file name even though the strings have the same length and the same characters.

$ dir
total 0

$ perl -wE'
   use utf8;
   $fu="æ";
   $fd="\x{E6}";
   say sprintf "%vX", $_ for $fu, $fd;
   say $fu eq $fd ? "eq" : "ne";
   system("touch", $_) for "u".$fu, "d".$fd
'
E6
E6
eq

$ dir
total 0
-rw------- 1 ikegami ikegami 0 Jul 12 12:18 uæ
-rw------- 1 ikegami ikegami 0 Jul 12 12:18 d?
like image 101
ikegami Avatar answered Sep 29 '22 21:09

ikegami