I'm extremely new to perl (and programming, for that matter) so I'm sorry if this is just a stupid mistake.
I'm trying to write a script that pulls a list of files from a .txt file, opens each one, looks for lines that match some regex, and prints those lines to a new file in a structure that will make a valid .csv file (using the capture groups in the regex).
My script works for English UTF-8 files, but when it tries to process non-English files the text data appears with spaces between each letter and the regex doesn't match - I'm guessing this is because they're saved in UTF-16. My thinking was to make the open command three parts, so that it also uses the ":encoding(UTF-16)" parameter for non-English files, but that's resulted in an invalid argument error. In fact, I can't get the script to run at all without using a two-part open command.
Here's my script.
use 5.010;
use strict;
use warnings;
use File::Slurp;
my @intfilelist = read_file('filelist_int.txt');
unlink "int_temp.csv";
foreach my $intfile (@intfilelist) {
open (my $file, "<:encoding(UTF-16)", $intfile) or die "Whoops! $!";
while (my $line = <$file>) {
if ($line =~ m/^(\d{3,5})\t(.*)$/) {
chomp $line;
open (my $csv, ">>", "int_temp.csv");
print $csv ("\"$intfile\",\"$1\",\"$2\"\n");
close $csv;
}
}
}
Changing open (my $file, "<:encoding(UTF-16)", $intfile)
to open (my $file, $intfile)
causes the script to work, except for the aforementioned issues with non-English files.
Like I said, I've only been playing with perl for 2 days, so sorry if I've misused some terminology or overlooked something obvious. Appreciate any help!
Remove the newline at the end of the filenames that you read from the first file with File::Slurp
. You can do this with chomp $intfile;
right before the open
.
chomp
(see Perldoc Chomp) removes newlines from the end of a given string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With