I am writing a Perl program to convert my local language ASCII characters to Unicode characters (Tamil).
This is my program
#!/bin/perl
use strict;
use warnings;
use open ':std';
use open ':encoding(UTF-8)';
use Encode qw( encode decode );
use Data::Dump qw(dump);
use Getopt::Long qw(GetOptions);
Getopt::Long::Configure qw(gnu_getopt);
my $font;
my %map;
GetOptions(
'font|f=s' => \$font,
'help|h' => \&usage,
) or die "Try $0 -h for help";
print "Do you want to map $font? (y/n)";
chomp( my $answer = lc <STDIN> );
$font = lc( $font );
$font =~ s/ /_/;
$font =~ s/(.*?)\.ttf/$1/;
if ( $answer eq "y" ) {
map_font();
}
else {
restore_map();
}
foreach ( @ARGV ) {
my $modfile = "$_";
$modfile =~ s/.*\/(.*)/uni$1/;
process_file( $_, $modfile );
}
sub process_file {
my @options = @_;
open my $source, '<', "$options[0]";
my $result = $options[1];
my $test = "./text";
my $missingchar = join( "|", map( quotemeta, sort { length $b <=> length $a } keys %map ) );
while ( <$source> ) {
$/ = undef;
s/h;/u;/g; #Might need change based on the tamil font
s/N(.)/$1N/g; #Might need change based on the tamil font
s/n(.)/$1n/g; #Might need change based on the font
s/($missingchar)/$map{$1}/g;
print "$_";
open my $final, '>:utf8', "$result";
print $final "$_";
close $final;
}
}
sub map_font {
my @oddhexes = qw/0B95 0B99 0B9A 0B9E 0B9F 0BA3 0BA4 0BA8 0BAA 0BAE 0BAF 0BB0 0BB2 0BB5 0BB3 0BB4 0BB1 0BA9/;
my @missingletters = qw/0BC1 0BC2/;
my @rest = qw/0B85 0B86 0B87 0B88 0B89 0B8A 0B8E 0B8F 0B90 0B92 0B93 0B83 0BBE 0BBF 0BC0 0BC6 0BC7 0BC8 0BCD 0B9C 0BB7 0BB8 0BB9 0BCB 0BCA 0BCC/;
foreach ( @oddhexes ) {
my $oddhex = $_;
$_ = encode( 'utf8', chr( hex( $_ ) ) );
print "Press the key for $_ :";
chomp( my $bole = <STDIN> );
if ( $bole eq "" ) {
next;
}
$map{$bole} = $_;
foreach ( @missingletters ) {
my $oddchar = encode( 'utf8', chr( hex( $oddhex ) ) . chr( hex( $_ ) ) );
print "Press the key for $oddchar :";
chomp( my $missingchar = <STDIN> );
if ( $missingchar eq "" ) {
next
}
$map{$missingchar} = $oddchar;
}
}
foreach ( @rest ) {
$_ = encode( 'utf8', chr( hex( $_ ) ) );
print "Press the key for $_ :";
chomp( my $misc = <STDIN> );
if ( $misc eq "" ) {
next
}
$map{$misc} = $_;
}
open my $OUTPUT, '>', $font || die "can't open file";
print $OUTPUT dump( \%map );
close $OUTPUT;
}
sub restore_map {
open my $in, '<', "$font" || die "can't open file: $!";
{
local $/;
%map = %{ eval <$in> };
}
close $in;
}
sub usage {
print "\nUsage: $0 [options] {file1.txt file2.txt..} \neg: $0 -f TamilBible.ttf chapter.txt\n\nOptions:\n -f --font - used to pass font name\n -h --help - Prints help\n\nManual mapping of font is essential for using this program\n";
exit;
}
In subroutine process_file
, output of print "$_";
displays proper Tamil Unicode characters in the terminal.
However the output to the file handle $final
is very different.
The %map
is here.
Why are the outputs different?
How can I correct this behaviour?
I have seen this question but this is not the same. In my case the terminal displays the result correctly while the filehandle output is different.
Your open statement
open my $final, '>:utf8', "$result";
sets your file handle to expect characters, and to encode into UTF-8 sequences then on the way out. But you are sending it pre-encoded byte sequences from the %map
hash, which causes those bytes to be treated as character and encoded again by Perl IO
In contrast, your terminal is set to expect UTF-8-encoded data, but STDOUT
isn't set to do any encoding at all (use open ':std'
has no effect on its own, see below) so it passes your UTF-8-encoded bytes through unchanged which happens to be what the terminal expects
By the way, you have set a default open mode of :encoding(UTF-8)
for input and output streams with
use open ':encoding(UTF-8)'
but have overridden it in your call to open
. The :utf8
mode does a very basic translation from wide characters to byte sequences, but :encoding(UTF-8)
is far more useful because it checks that each character being printed is a valid Unicode value. There is a good chance that it would have caught a mistake like this, and it would have been better to allow the default and write just
open my $final, '>', $result;
To keep things clean and tidy, your program should work in characters, and the file handles should be set to encode those characters to UTF-8 when those characters are printed
You can set UTF-8 as the default encoding for all newly-opened file handles as well as STDIN
and STDOUT
by adding
use open qw/ :std :encoding(utf-8) /;
to the top of your program (:encoding(utf-8)
is preferable to :utf8
) and remove all calls to encode
. You had it almost right, but the :std
and :encoding(utf-8)
need to be in the same use
statement
You should also add
use utf8;
at the very top so that you can use UTF-8 characters in the program itself
You also have a few incidental errors. For instance
In the statement
open my $in, '<', "$font" || die "can't open file: $!";
it is almost always wrong to quote a single scalar variable like $font
unless it happens to be an object and you want to invoke the stringification method
You need or
instead of ||
, otherwise you're just testing the truth of $font
If I asked you what a variable called $in
might contain I think you'd be hesitant; $in_fh
is better and is a common idiom
It's always nice to put the name of the file into the die
string as well as the reason from $!
Taking all of those into account makes your statement look like this
open my $in_fh, '<', $font or die qq{Unable to open "$font" for input: $!};
You should be consistent between upper and lower case scalar variables, and lower case is the correct choice. So
open my $OUTPUT, '>', $font || die "can't open file";
should be something like
open my $out_fh, '>', $font or die qq{Unable to open "$font" for output: $!};
The line
$/ = undef;
should be local $/
as you have used elsewhere, otherwise you are permanently modifying the input record separator for the rest of your program and modules. It also appears after the first read from the file handle, so your program will read and process one line, and then the whole of the rest of the file in the next iteration of the while
loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With