I have two Json files which come from different OSes.
Both files are encoded in UTF-8
and contain UTF-8
encoded filenames
.
One file comes from OS X and the filename is in NFD form: (od -bc
)
0000160 166 145 164 154 141 314 201 057 110 157 165 163 145 040 155 145
v e t l a ́ ** / H o u s e m e
the second contains the same filename but in NFC form:
000760 166 145 164 154 303 241 057 110 157 165 163 145 040 155 145 163
v e t l á ** / H o u s e m e s
As I have learned, this is called 'different normalization', and there is an CPAN module Unicode::Normalize
for handling it.
I'm reading both files with the next:
my $json1 = decode_json read_file($file1, {binmode => ':raw'}) or die "..." ;
my $json2 = decode_json read_file($file2, {binmode => ':raw'}) or die "..." ;
The read_file is from File::Slurp
and decode_json from the JSON::XS
.
Reading the JSON into perl structure, from one json file the filename comes into key
position and from the second file comes into the values
. I need to search when the hash key
from the 1st hash is equvalent to a value
from the second hash, so need ensure than they are "binary" identical.
Tried the next:
grep 'House' file1.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc
and
grep 'House' file2.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc
produces for me the same output.
Now the questions:
$hashrefs
?or need after the decode_json
run someting like on both hashes?
while(my($k,$v) = each(%$json1)) {
$copy->{ NFD($k) } = NFD($v);
}
In short:
$href
? It is possible to achieve somewhat nicer as explicitly doing NFD
on each key
value
and creating another NFD normalized (big) copy of the hashes?Some hints, suggestions - pleae...
Because my english is very bad, here is a simulation of the problem
use 5.014;
use warnings;
use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);
use File::Slurp;
use Data::Dumper;
use JSON::XS;
#Creating two files what contains different "normalizations"
my($nfc, $nfd);;
$nfc->{ NFC('key') } = NFC('vál');
$nfd->{ NFD('vál') } = 'something';
#save as NFC - this comes from "FreeBSD"
my $jnfc = JSON::XS->new->encode($nfc);
open my $fd, ">:utf8", "nfc.json" or die("nfc");
print $fd $jnfc;
close $fd;
#save as NFD - this comes from "OS X"
my $jnfd = JSON::XS->new->encode($nfd);
open $fd, ">:utf8", "nfd.json" or die("nfd");
print $fd $jnfd;
close $fd;
#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;
say $jd->{ $jc->{key} } // "NO FOUND"; #wanted to print "something"
my $jc2;
#is here a better way to DO THIS?
while(my($k,$v) = each(%$jc)) {
$jc2->{ NFD($k) } = NFD($v);
}
say $jd->{ $jc2->{key} } // "NO FOUND"; #OK
While searching the right solution for your question i discovered: the software is c*rp :) See: https://stackoverflow.com/a/17448888/632407 .
Anyway, found the solution for your particular question - how to read json with filenames regardless of normalization:
instead of your:
#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;
use the next:
#now read them
my $jc = get_json_from_utf8_file('nfc.json') ;
my $jd = get_json_from_utf8_file('nfd.json') ;
...
sub get_json_from_utf8_file {
my $file = shift;
return
decode_json #let parse the json to perl
encode 'utf8', #the decode_json want utf8 encoded binary string, encode it
NFC #conv. to precomposed normalization - regardless of the source
read_file #your file contains utf8 encoded text, so read it correctly
$file, { binmode => ':utf8' } ;
}
This should (at least i hope) ensure than regardles what decomposition uses the JSON content, the NFC
will convert it to precomposed version and the JSON:XS will read parse it correctly to the same internal perl structure.
So your example prints:
something
without traversing the $json
The idea comes from Joseph Myers and Nemo ;)
Maybe some more skilled programmers will give more hints.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With