I am trying to improve the warning message issued by Encode::decode()
. Instead of printing the name of the module and the line number in the module, I would like it to print the name of the file being read and the line number in that file where the malformed data was found. To a developer, the origial message can be useful, but to an end user not familiar with Perl, it is probably quite meaningless. The end user would probably rather like to know which file is giving the problem.
I first tried to solve this using a $SIG{__WARN__}
handler (which is probably not a good idea), but I get a segfault. Probably a silly mistake, but I could not figure it out:
#! /usr/bin/env perl
use feature qw(say);
use strict;
use warnings;
use Encode ();
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
my $fn = 'test.txt';
write_test_file( $fn );
# Try to improve the Encode::FB_WARN fallback warning message :
#
# utf8 "\xE5" does not map to Unicode at <module_name> line xx
#
# Rather we would like the warning to print the filename and the line number:
#
# utf8 "\xE5" does not map to Unicode at line xx of file <filename>.
my $str = '';
open ( my $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
{
local $SIG{__WARN__} = sub { my_warn_handler( $fn, $_[0] ) };
$str = do { local $/; <$fh> };
}
close $fh;
say "Read string: '$str'";
sub my_warn_handler {
my ( $fn, $msg ) = @_;
if ( $msg =~ /\Qdoes not map to Unicode\E/ ) {
recover_line_number_and_char_pos( $fn, $msg );
}
else {
warn $msg;
}
}
sub recover_line_number_and_char_pos {
my ( $fn, $err_msg ) = @_;
chomp $err_msg;
$err_msg =~ s/(line \d+)\.$/$1/; # Remove period at end of sentence.
open ( $fh, "<:raw", $fn ) or die "Could not open file '$fn': $!";
my $raw_data = do { local $/; <$fh> };
close $fh;
my $str = Encode::decode( 'utf-8', $raw_data, Encode::FB_QUIET );
my ($header, $last_line) = $str =~ /^(.*\n)([^\n]*)$/s;
my $line_no = $str =~ tr/\n//;
++$line_no;
my $pos = ( length $last_line ) + 1;
warn "$err_msg, in file '$fn' (line: $line_no, pos: $pos)\n";
}
sub write_test_file {
my ( $fn ) = @_;
my $bytes = "Hello\nA\x{E5}\x{61}"; # 2 lines ending in iso 8859-1: åa
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
print $fh $bytes;
close $fh;
}
Output:
utf8 "\xE5" does not map to Unicode at ./p.pl line 27
, in file 'test.txt' (line: 2, pos: 2)
Segmentation fault (core dumped)
Here is another way to locate where the warning fires, with un-buffered sysread
use warnings;
use strict;
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
my $file = 'test.txt';
open my $fh, "<:encoding(UTF-8)", $file or die "Can't open $file: $!";
$SIG{__WARN__} = sub { print "\t==> WARN: @_" };
my $char_cnt = 0;
my $char;
while (sysread($fh, $char, 1)) {
++$char_cnt;
print "$char ($char_cnt)\n";
}
The file test.txt
was written by the posted program, except that I had to add to it to reproduce the behavior -- it runs without warnings on v5.10 and v5.16. I added \x{234234}
to the end. The line number can be tracked with $char =~ /\n/
.
The sysread
returns undef
on error. It can be moved into the body of while (1)
to allow reads to continue and catch all warnings, breaking out on 0
(returned on EOF).
This prints
H (1) e (2) l (3) l (4) o (5) (6) A (7) å (8) a (9) ==> WARN: Code point 0x234234 is not Unicode, may not be portable at ... (10)
While this does catch the character warned about, re-reading the file using Encode
may well be better than reaching for sysread
, in particular if sysread
uses Encode
.
However, Perl is utf8
internally and I am not sure that sysread
needs Encode
.
Note. The page for sysread supports its use on data with encoding layers
Note that if the filehandle has been marked as
:utf8
Unicode characters are read instead of bytes (the LENGTH, OFFSET, and the return value ofsysread
are in Unicode characters). The:encoding(...)
layer implicitly introduces the:utf8
layer. Seebinmode
,open
, and theopen
pragma.
Note Apparently, things have moved on and after a certain version sysread
does not support encoding layers. The link above, while for an older version (v5.10 for one) indeed shows what is quoted, with a newer version tells us that there'll be an exception.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With