Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Printing to a file vs printing to shell in Perl

I am writing a Perl program to convert my local language ASCII characters to Unicode characters (Tamil).

This is my program

#!/bin/perl
use strict;
use warnings;

use open ':std';
use open ':encoding(UTF-8)';

use Encode qw( encode decode );
use Data::Dump qw(dump);
use Getopt::Long qw(GetOptions);

Getopt::Long::Configure qw(gnu_getopt);

my $font;
my %map;
GetOptions(
    'font|f=s' => \$font,
    'help|h'   => \&usage,
) or die "Try $0 -h for help";

print "Do you want to map $font? (y/n)";
chomp( my $answer = lc <STDIN> );

$font = lc( $font );
$font =~ s/ /_/;
$font =~ s/(.*?)\.ttf/$1/;

if ( $answer eq "y" ) {
    map_font();
}
else {
    restore_map();
}

foreach ( @ARGV ) {

    my $modfile = "$_";

    $modfile =~ s/.*\/(.*)/uni$1/;

    process_file( $_, $modfile );
}

sub process_file {

    my @options = @_;

    open my $source, '<', "$options[0]";
    my $result = $options[1];
    my $test   = "./text";
    my $missingchar = join( "|", map( quotemeta, sort { length $b <=> length $a } keys %map ) );

    while ( <$source> ) {
        $/ = undef;
        s/h;/u;/g;       #Might need change based on the tamil font
        s/N(.)/$1N/g;    #Might need change based on the tamil font
        s/n(.)/$1n/g;    #Might need change based on the font
        s/($missingchar)/$map{$1}/g;

        print "$_";

        open my $final, '>:utf8', "$result";
        print $final "$_";
        close $final;
    }
}

sub map_font {

    my @oddhexes = qw/0B95 0B99 0B9A 0B9E 0B9F 0BA3 0BA4 0BA8 0BAA 0BAE 0BAF 0BB0 0BB2 0BB5 0BB3 0BB4 0BB1 0BA9/;
    my @missingletters = qw/0BC1 0BC2/;
    my @rest = qw/0B85 0B86 0B87 0B88 0B89 0B8A 0B8E 0B8F 0B90 0B92 0B93 0B83  0BBE  0BBF  0BC0  0BC6  0BC7  0BC8  0BCD  0B9C  0BB7  0BB8  0BB9 0BCB 0BCA 0BCC/;

    foreach ( @oddhexes ) {

        my $oddhex = $_;

        $_ = encode( 'utf8', chr( hex( $_ ) ) );
        print "Press the key for $_   :";
        chomp( my $bole = <STDIN> );
        if ( $bole eq "" ) {
            next;
        }

        $map{$bole} = $_;

        foreach ( @missingletters ) {

            my $oddchar = encode( 'utf8', chr( hex( $oddhex ) ) . chr( hex( $_ ) ) );

            print "Press the key for $oddchar   :";
            chomp( my $missingchar = <STDIN> );
            if ( $missingchar eq "" ) {
                next
            }

            $map{$missingchar} = $oddchar;
        }

    }

    foreach ( @rest ) {

        $_ = encode( 'utf8', chr( hex( $_ ) ) );

        print "Press the key for $_   :";
        chomp( my $misc = <STDIN> );
        if ( $misc eq "" ) {
            next
        }

        $map{$misc} = $_;
    }

    open my $OUTPUT, '>', $font || die "can't open file";
    print $OUTPUT dump( \%map );
    close $OUTPUT;
}

sub restore_map {

    open my $in, '<', "$font" || die "can't open file: $!";

    {
        local $/;
        %map = %{ eval <$in> };
    }

    close $in;
}

sub usage {
    print "\nUsage: $0 [options] {file1.txt file2.txt..} \neg: $0 -f TamilBible.ttf chapter.txt\n\nOptions:\n  -f --font - used to pass font name\n  -h --help - Prints help\n\nManual mapping of font is essential for using this program\n";
    exit;
}

In subroutine process_file, output of print "$_"; displays proper Tamil Unicode characters in the terminal.

However the output to the file handle $final is very different.

The %map is here.

Why are the outputs different?

How can I correct this behaviour?

I have seen this question but this is not the same. In my case the terminal displays the result correctly while the filehandle output is different.

like image 410
One Face Avatar asked Aug 11 '15 11:08

One Face


1 Answers

Your open statement

open my $final, '>:utf8', "$result";

sets your file handle to expect characters, and to encode into UTF-8 sequences then on the way out. But you are sending it pre-encoded byte sequences from the %map hash, which causes those bytes to be treated as character and encoded again by Perl IO

In contrast, your terminal is set to expect UTF-8-encoded data, but STDOUT isn't set to do any encoding at all (use open ':std' has no effect on its own, see below) so it passes your UTF-8-encoded bytes through unchanged which happens to be what the terminal expects

By the way, you have set a default open mode of :encoding(UTF-8) for input and output streams with

use open ':encoding(UTF-8)'

but have overridden it in your call to open. The :utf8 mode does a very basic translation from wide characters to byte sequences, but :encoding(UTF-8) is far more useful because it checks that each character being printed is a valid Unicode value. There is a good chance that it would have caught a mistake like this, and it would have been better to allow the default and write just

open my $final, '>', $result;

To keep things clean and tidy, your program should work in characters, and the file handles should be set to encode those characters to UTF-8 when those characters are printed

You can set UTF-8 as the default encoding for all newly-opened file handles as well as STDIN and STDOUT by adding

use open qw/ :std :encoding(utf-8) /;

to the top of your program (:encoding(utf-8) is preferable to :utf8) and remove all calls to encode. You had it almost right, but the :std and :encoding(utf-8) need to be in the same use statement

You should also add

use utf8;

at the very top so that you can use UTF-8 characters in the program itself

You also have a few incidental errors. For instance

  • In the statement

    open my $in, '<', "$font" || die "can't open file: $!";
    

    it is almost always wrong to quote a single scalar variable like $font unless it happens to be an object and you want to invoke the stringification method

    You need or instead of ||, otherwise you're just testing the truth of $font

    If I asked you what a variable called $in might contain I think you'd be hesitant; $in_fh is better and is a common idiom

    It's always nice to put the name of the file into the die string as well as the reason from $!

    Taking all of those into account makes your statement look like this

    open my $in_fh, '<', $font or die qq{Unable to open "$font" for input: $!};
    
  • You should be consistent between upper and lower case scalar variables, and lower case is the correct choice. So

    open my $OUTPUT, '>', $font || die "can't open file";
    

    should be something like

    open my $out_fh, '>', $font or die qq{Unable to open "$font" for output: $!};
    
  • The line

    $/ = undef;
    

    should be local $/ as you have used elsewhere, otherwise you are permanently modifying the input record separator for the rest of your program and modules. It also appears after the first read from the file handle, so your program will read and process one line, and then the whole of the rest of the file in the next iteration of the while loop

like image 163
Borodin Avatar answered Nov 05 '22 09:11

Borodin