Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalization on utf8 filenames stored in JSON with perl

Tags:

json

utf-8

perl

I have two Json files which come from different OSes.

Both files are encoded in UTF-8 and contain UTF-8 encoded filenames.

One file comes from OS X and the filename is in NFD form: (od -bc)

0000160   166 145 164 154 141 314 201 057 110 157 165 163 145 040 155 145
           v   e   t   l   a    ́  **   /   H   o   u   s   e       m   e

the second contains the same filename but in NFC form:

000760   166 145 164 154 303 241 057 110 157 165 163 145 040 155 145 163
           v   e   t   l   á  **   /   H   o   u   s   e       m   e   s

As I have learned, this is called 'different normalization', and there is an CPAN module Unicode::Normalize for handling it.

I'm reading both files with the next:

my $json1 = decode_json read_file($file1, {binmode => ':raw'}) or die "..." ;
my $json2 = decode_json read_file($file2, {binmode => ':raw'}) or die "..." ;

The read_file is from File::Slurp and decode_json from the JSON::XS.

Reading the JSON into perl structure, from one json file the filename comes into key position and from the second file comes into the values. I need to search when the hash key from the 1st hash is equvalent to a value from the second hash, so need ensure than they are "binary" identical.

Tried the next:

 grep 'House' file1.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc

and

 grep 'House' file2.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc

produces for me the same output.

Now the questions:

  • How to simply read both json files to get the same normalization into the both $hashrefs?

or need after the decode_json run someting like on both hashes?

while(my($k,$v) = each(%$json1)) {
    $copy->{ NFD($k) } = NFD($v);
}

In short:

  • How to read different JSON files to get the same normalization 'inside' the perl $href? It is possible to achieve somewhat nicer as explicitly doing NFD on each key value and creating another NFD normalized (big) copy of the hashes?

Some hints, suggestions - pleae...

Because my english is very bad, here is a simulation of the problem

use 5.014;
use warnings;

use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);

use File::Slurp;
use Data::Dumper;
use JSON::XS;

#Creating two files what contains different "normalizations"
my($nfc, $nfd);;
$nfc->{ NFC('key') } = NFC('vál');
$nfd->{ NFD('vál') } = 'something';

#save as NFC - this comes from "FreeBSD"
my $jnfc =  JSON::XS->new->encode($nfc);
open my $fd, ">:utf8", "nfc.json" or die("nfc");
print $fd $jnfc;
close $fd;

#save as NFD - this comes from "OS X"
my $jnfd =  JSON::XS->new->encode($nfd);
open $fd, ">:utf8", "nfd.json" or die("nfd");
print $fd $jnfd;
close $fd;

#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;

say $jd->{ $jc->{key} } // "NO FOUND";    #wanted to print "something"

my $jc2;
#is here a better way to DO THIS?
while(my($k,$v) = each(%$jc)) {
    $jc2->{ NFD($k) } = NFD($v);
}
say $jd->{ $jc2->{key} } // "NO FOUND";    #OK
like image 255
kobame Avatar asked Jul 02 '13 19:07

kobame


1 Answers

While searching the right solution for your question i discovered: the software is c*rp :) See: https://stackoverflow.com/a/17448888/632407 .

Anyway, found the solution for your particular question - how to read json with filenames regardless of normalization:

instead of your:

#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;

use the next:

#now read them
my $jc = get_json_from_utf8_file('nfc.json') ;
my $jd = get_json_from_utf8_file('nfd.json') ;
...

sub get_json_from_utf8_file {
    my $file = shift;
    return
      decode_json      #let parse the json to perl
        encode 'utf8', #the decode_json want utf8 encoded binary string, encode it
          NFC          #conv. to precomposed normalization - regardless of the source
            read_file  #your file contains utf8 encoded text, so read it correctly
              $file, { binmode => ':utf8' } ;
}

This should (at least i hope) ensure than regardles what decomposition uses the JSON content, the NFC will convert it to precomposed version and the JSON:XS will read parse it correctly to the same internal perl structure.

So your example prints:

something

without traversing the $json

The idea comes from Joseph Myers and Nemo ;)

Maybe some more skilled programmers will give more hints.

like image 78
jm666 Avatar answered Nov 15 '22 04:11

jm666