Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Outputting special characters correctly (unicode) in PERL

I am trying to get all the file names in the directory and determine which names contain special characters. I am using the regex

/[^-a-zA-Z0-9_.]/

SAMPLE FILES ( I created using touch ):

pdf-2014à014&7_06_64-Os_O&L,_Inc.pdf
pdf-20_06_04-O_OnLine,_Inc.pdf
pdf-20_0_0-Utà_d.wr.pdf
pdf-20_12_28-20.Mga_Grf.Fwd_Notice_KDJFI789&_JFK38.pdf
pdf-2_0_0-C_—_DUKE.pdf
pdf-2_1_3-f_s-M_F_D&A.pdf
pdf_-_2014à014&1007_0617_06264-O_O&L,_Inc.pdf

Perl can output the correct name once before I match the name for pattern in regex. Yes perl was somehow able to get match the special character but when outputing the character changes.

* pdf-2_0_0-C_—_DUKE.pdf          >        pdf-2_0_0-C_???_DUKE.pdf

I can try uncomment this line

   #binmode(STDOUT, ":utf8");

and run the commmand script again. Sure the ? marks will be remove but the output is also different.

* pdf-2_0_0-C_—_DUKE.pdf          >        pdf-2_0_0-C_â_DUKE.pdf

Here is my code:

use strict;
use warnings;
use File::Find;
use Cwd;

#binmode(STDOUT, ":utf8");

my $starting_directory = cwd();

use Term::ANSIColor;

checkForSpecialChar(cwd());


sub checkForSpecialChar{
    my ($source_dir) = @_;

    chdir $source_dir or die qq(Cannot change into "$source_dir");

    find ( sub {
        return unless -f;   #We want files only
        print "\n";
        while(m/([^-a-zA-Z0-9_.])/g){ 
            chomp($_);
            print "DETECTED: |" . $_ . "|\n";
            print $`;
            print color 'bold red';
            print "$1";
            print color 'reset';
            print  $' . "\n";

        }

    }, ".");

    chdir("$starting_directory");

Any Idea guys?

UPDATE: hmm, you guys are right looks like its not a problem with regex. Hi AKHolland, I tried changing the code to look just like yours for testing. but still produce the same problem with hypen and a small letter a-grave . Instead of a small letter a-grave it gives me a` when not using binmode(STDOUT, ":utf8"); aÌ when using binmode(STDOUT, ":utf8");

use strict;
use warnings;
use File::Find;
use Cwd;
use Encode;
binmode(STDOUT, ":utf8");

my $starting_directory = cwd();

use Term::ANSIColor;

checkForSpecialChar(cwd());


sub checkForSpecialChar{
   my ($source_dir) = @_;

   chdir $source_dir
       or die qq(Cannot change into "$source_dir");

   find ( sub {
      return unless -f;   #We want files only
     print $_ . "\n";
      $_ = Encode::decode_utf8($_);
      for(my $counter =0; $counter < length($_); $counter++) {
        print Encode::encode_utf8(substr($_,$counter,1)) .  "\n";
      } 

}, ".");

chdir("$starting_directory"); }

Output with

    binmode(STDOUT, ":utf8");

pdf-2_0_0-C_â_DUKE.pdf
p
d
f
-
2
_
0
_
0
-
C
_
â
_
D
U
K
E
.
p
d
f
pdf_-_2014aÌ014&1007_0617_06264-O_O&L,_Inc.pdf
p
d
f
_
-
_
2
0
1
4
a
Ì
0
1
4
&
1
0
0
7
_
0
6
1
7
_
0
6
2
6
4
-
O
_
O
&
L
,
_
I
n
c
.
p
d
f
OUTPUT without 
    binmode(STDOUT, ":utf8");

pdf-2_0_0-C_—_DUKE.pdf
p
d
f
-
2
_
0
_
0
-
C
_
—
_
D
U
K
E
.
p
d
f
pdf_-_2014à014&1007_0617_06264-O_O&L,_Inc.pdf
p
d
f
_
-
_
2
0
1
4
a
̀
0
1
4
&
1
0
0
7
_
0
6
1
7
_
0
6
2
6
4
-
O
_
O
&
L
,
_
I
n
c
.
p
d
f
like image 670
takuji Avatar asked Sep 29 '22 16:09

takuji


2 Answers

You need to decode it on the way in and encode it on the way out. Something like this:

use Encode;
find ( sub {
    $_ = Encode::decode_utf8($_);
    while(m/([^-a-zA-Z0-9_.])/g){
        my $chr = Encode::encode_utf8($1);
        print "$chr\n"
    }
}, ".");
like image 164
AKHolland Avatar answered Oct 05 '22 08:10

AKHolland


The character in pdf-2_0_0-C_—_DUKE.pdf is encoded with 3 char in utf-8:

char Unicode   UTF-8
—    U+2014    \xe2\x80\x94

so, as said @AKHolland, you have to encode it.

like image 38
Toto Avatar answered Oct 05 '22 06:10

Toto