Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to improve Encode::decode warning message: Segfault in $SIG{__WARN__} handler

Tags:

perl

I am trying to improve the warning message issued by Encode::decode(). Instead of printing the name of the module and the line number in the module, I would like it to print the name of the file being read and the line number in that file where the malformed data was found. To a developer, the origial message can be useful, but to an end user not familiar with Perl, it is probably quite meaningless. The end user would probably rather like to know which file is giving the problem.

I first tried to solve this using a $SIG{__WARN__} handler (which is probably not a good idea), but I get a segfault. Probably a silly mistake, but I could not figure it out:

#! /usr/bin/env perl

use feature qw(say);
use strict;
use warnings;

use Encode ();

binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';

my $fn = 'test.txt';
write_test_file( $fn );

# Try to improve the Encode::FB_WARN fallback warning message :
#
#   utf8 "\xE5" does not map to Unicode at <module_name> line xx
#
# Rather we would like the warning to print the filename and the line number:
#
#   utf8 "\xE5" does not map to Unicode at line xx of file <filename>.

my $str = '';
open ( my $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
{
    local $SIG{__WARN__} = sub { my_warn_handler( $fn, $_[0] ) }; 
    $str = do { local $/; <$fh> };
}
close $fh;
say "Read string: '$str'";

sub my_warn_handler {
    my ( $fn, $msg ) = @_;

    if ( $msg =~ /\Qdoes not map to Unicode\E/ ) {
        recover_line_number_and_char_pos( $fn, $msg );
    }
    else {
        warn $msg;
    }
}

sub recover_line_number_and_char_pos {
    my ( $fn, $err_msg ) = @_;

    chomp $err_msg;
    $err_msg =~ s/(line \d+)\.$/$1/;  # Remove period at end of sentence.
    open ( $fh, "<:raw", $fn ) or die "Could not open file '$fn': $!";
    my $raw_data = do { local $/; <$fh> };
    close $fh;
    my $str = Encode::decode( 'utf-8', $raw_data, Encode::FB_QUIET );
    my ($header, $last_line) = $str =~ /^(.*\n)([^\n]*)$/s; 
    my $line_no = $str =~ tr/\n//;
    ++$line_no;
    my $pos = ( length $last_line ) + 1;
    warn "$err_msg, in file '$fn' (line: $line_no, pos: $pos)\n";
}

sub write_test_file {
    my ( $fn ) = @_;

    my $bytes = "Hello\nA\x{E5}\x{61}";  # 2 lines ending in iso 8859-1: åa
    open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
    print $fh $bytes;
    close $fh;
}

Output:

utf8 "\xE5" does not map to Unicode at ./p.pl line 27
, in file 'test.txt' (line: 2, pos: 2)
Segmentation fault (core dumped)
like image 614
Håkon Hægland Avatar asked Dec 28 '16 11:12

Håkon Hægland


1 Answers

Here is another way to locate where the warning fires, with un-buffered sysread

use warnings;
use strict;

binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';

my $file = 'test.txt';
open my $fh, "<:encoding(UTF-8)", $file or die "Can't open $file: $!";

$SIG{__WARN__} = sub { print "\t==> WARN: @_" };

my $char_cnt = 0;    
my $char;

while (sysread($fh, $char, 1)) {
    ++$char_cnt;
    print "$char ($char_cnt)\n";
}

The file test.txt was written by the posted program, except that I had to add to it to reproduce the behavior -- it runs without warnings on v5.10 and v5.16. I added \x{234234} to the end. The line number can be tracked with $char =~ /\n/.

The sysread returns undef on error. It can be moved into the body of while (1) to allow reads to continue and catch all warnings, breaking out on 0 (returned on EOF).

This prints

H (1)
e (2)
l (3)
l (4)
o (5)

 (6)
A (7)
å (8)
a (9)
        ==> WARN: Code point 0x234234 is not Unicode, may not be portable at ...
 (10)

While this does catch the character warned about, re-reading the file using Encode may well be better than reaching for sysread, in particular if sysread uses Encode.

However, Perl is utf8 internally and I am not sure that sysread needs Encode.

Note. The page for sysread supports its use on data with encoding layers

Note that if the filehandle has been marked as :utf8 Unicode characters are read instead of bytes (the LENGTH, OFFSET, and the return value of sysread are in Unicode characters). The :encoding(...) layer implicitly introduces the :utf8 layer. See binmode, open, and the open pragma.


Note   Apparently, things have moved on and after a certain version sysread does not support encoding layers. The link above, while for an older version (v5.10 for one) indeed shows what is quoted, with a newer version tells us that there'll be an exception.

like image 88
zdim Avatar answered Dec 12 '22 13:12

zdim