I am working on an Android word game with a large dictionary -
The words (over 700 000) are kept as separate lines in a text file (and then put in an SQLite database).
To protect my dictionary, I'd like to encode all words which are longer than 3 chars with md5. (I don't obfuscate short words and words with rare Russian letters ъ
and э
, because I'd like to list them in my app).
So here is my script which I try to run with perl v5.18.2 on Mac Yosemite:
#!/usr/bin/perl -w
use strict;
use utf8;
use Digest::MD5 qw(md5_hex);
binmode(STDIN, ":utf8");
#binmode(STDOUT, ":raw");
binmode(STDOUT, ":utf8");
while(<>) {
chomp;
next if length($_) < 2; # ignore 1 letter junk
next if /жы/; # impossible combination in Russian
next if /шы/; # impossible combination in Russian
s/ё/е/g;
if (length($_) <= 3 || /ъ/ || /э/) { # do not obfuscate short words
print "$_\n"; # and words with rare letters
next;
}
print md5_hex($_) . "\n"; # this line crashes
}
As you can see, I have to use cyrillic letters in the source code of my Perl script - that is why I've put use utf8;
on its top.
However my real problem is that length($_)
reports too high values (probably reporting number of bytes instead of number of characters).
So I have tried adding:
binmode(STDOUT, ":raw");
or:
binmode(STDOUT, ":utf8");
But the script then dies with Wide character in subroutine entry at the line with print md5_hex($_)
.
Please help me to fix my script.
I run it as:
perl ./generate-md5.pl < words.txt > encoded.txt
and here is example words.txt data for your convenience:
а
аб
абв
абвг
абвгд
съемка
UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.
UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters. As of the mid 2020s, UTF-8 is one of the most popular encoding systems.
UTF-8 uses one byte at the minimum in encoding the characters while UTF-16 uses minimum two bytes. In UTF-8, every code point from 0-127 is stored in a single bytes. Only code points 128 and above are stored using 2,3 or in fact, up to 4 bytes.
md5_hex
expects a string of bytes for input, but you're passing a decoded string (a string of Unicode Code Points). Explicitly encode the string.
use strict;
use utf8;
use Digest::MD5;
use Encode;
# ....
# $_ is assumed to be utf8 encoded without check
print Digest::MD5::md5_hex(Encode::encode_utf8($_)),"\n";
# Conversion only when required:
print Digest::MD5::md5_hex(utf8::is_utf8($_) ? Encode::encode_utf8($_) : $_),"\n";
if you are using perl version 5.0 and above then this can be resolved by changing to_json to encode_json
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With