'Wide character in subroutine entry" - UTF-8 encoded cyrillic words as sequence of bytes

Tags:

I am working on an Android word game with a large dictionary -

app screenshot

The words (over 700 000) are kept as separate lines in a text file (and then put in an SQLite database).

To protect my dictionary, I'd like to encode all words which are longer than 3 chars with md5. (I don't obfuscate short words and words with rare Russian letters ъ and э, because I'd like to list them in my app).

So here is my script which I try to run with perl v5.18.2 on Mac Yosemite:

#!/usr/bin/perl -w

use strict;
use utf8;
use Digest::MD5 qw(md5_hex);

binmode(STDIN, ":utf8");
#binmode(STDOUT, ":raw");
binmode(STDOUT, ":utf8");

while(<>) {
        chomp;
        next if length($_) < 2; # ignore 1 letter junk
        next if /жы/;           # impossible combination in Russian
        next if /шы/;           # impossible combination in Russian

        s/ё/е/g;
    
        if (length($_) <= 3 || /ъ/ || /э/) { # do not obfuscate short words
                print "$_\n";                # and words with rare letters
                next;
        }

        print md5_hex($_) . "\n";            # this line crashes
}

As you can see, I have to use cyrillic letters in the source code of my Perl script - that is why I've put use utf8; on its top.

However my real problem is that length($_) reports too high values (probably reporting number of bytes instead of number of characters).

So I have tried adding:

binmode(STDOUT, ":raw");

or:

binmode(STDOUT, ":utf8");

But the script then dies with Wide character in subroutine entry at the line with print md5_hex($_).

Please help me to fix my script.

I run it as:

perl ./generate-md5.pl < words.txt > encoded.txt

and here is example words.txt data for your convenience:

а
аб
абв
абвг
абвгд
съемка

447

asked Aug 29 '15 12:08

Alexander Farber

2 Answers

md5_hex expects a string of bytes for input, but you're passing a decoded string (a string of Unicode Code Points). Explicitly encode the string.

use strict;
use utf8;
use Digest::MD5;
use Encode;
# ....
# $_ is assumed to be utf8 encoded without check
print Digest::MD5::md5_hex(Encode::encode_utf8($_)),"\n";
# Conversion only when required:
print Digest::MD5::md5_hex(utf8::is_utf8($_) ? Encode::encode_utf8($_) : $_),"\n";

146

answered Sep 19 '22 08:09

AnFi

if you are using perl version 5.0 and above then this can be resolved by changing to_json to encode_json

answered Sep 18 '22 08:09

Tony Aziz

Related questions
                            
                                How does Perl avoid shebang loops?
                            
                                Perl increment operator
                            
                                How can I check (peek) STDIN for piped data in Perl without using select?
                            
                                Is Perl guaranteed to return consistently-ordered hash keys?
                            
                                Why do I get "Can't use string as a HASH ref" error when I try to access a hash element?
                            
                                Two versions of Perl in Mac OS X?
                            
                                Is there an equivalent of Bottle or Sinatra for Perl6 or Perl5? [closed]
                            
                                How can close and reopen STDOUT in Perl?
                            
                                Gitweb: how to display markdown file in html format automatically like github
                            
                                SHA256 digest in perl
                            
                                How to skip first element in array with foreach?
                            
                                How can I update Perl on Windows without losing modules?
                            
                                How can I set the Windows PATH variable from Perl?
                            
                                How can I improve Moose performance in non-persistent CGI processes?
                            
                                How do I load a file relative to a module path?
                            
                                How do I implement a dispatch table in a Perl OO module?
                            
                                Simulating aspects of static-typing in a duck-typed language
                            
                                Parsing an array encoded in JSON through perl
                            
                                In Perl, why does copying a weak reference create a normal, strong, reference?
                            
                                How to capture both STDOUT and STDERR in two different variables using Backticks in Perl

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

'Wide character in subroutine entry" - UTF-8 encoded cyrillic words as sequence of bytes

Tags:

unicode

utf-8

md5

perl

cyrillic

Alexander Farber

People also ask

2 Answers

AnFi

Tony Aziz

Recent Activity

Donate For Us