Although this is pretty basic, I can't find a similar question, so please link to one if you know of an existing question/solution on SO.
I have a .txt
file that is about 2MB and about 16,000 lines long. Each record length is 160 characters with a blocking factor of 10. This is an older type of data structure which almost looks like a tab-delimited file, but the separation is by single-chars/white-spaces.
First, I glob
a directory for .txt
files - there is never more than one file in the directory at a time, so this attempt may be inefficient in itself.
my $txt_file = glob "/some/cheese/dir/*.txt";
Then I open the file with this line:
open (F, $txt_file) || die ("Could not open $txt_file");
As per the data dictionary for this file, I'm parsing each "field" out of each line using Perl's substr()
function within a while loop.
while ($line = <F>)
{
$nom_stat = substr($line,0,1);
$lname = substr($line,1,15);
$fname = substr($line,16,15);
$mname = substr($line,31,1);
$address = substr($line,32,30);
$city = substr($line,62,20);
$st = substr($line,82,2);
$zip = substr($line,84,5);
$lnum = substr($line,93,9);
$cl_rank = substr($line,108,4);
$ceeb = substr($line,112,6);
$county = substr($line,118,2);
$sex = substr($line,120,1);
$grant_type = substr($line,121,1);
$int_major = substr($line,122,3);
$acad_idx = substr($line,125,3);
$gpa = substr($line,128,5);
$hs_cl_size = substr($line,135,4);
}
Can anyone suggest a more efficient/preferred method?
It looks to me that you are working with fixed width fields here. Is that true? If it is, the unpack
function is what you need. You provide the template for the fields and it will extract the info from those fields. There is a tutorial available, and the template information is found in the documentation for pack
which is unpack
's logical inverse. As a basic example simply:
my @values = unpack("A1 A15 A15 ...", $line);
where 'A' means any text character (as I understand it) and the number is how many. There is quite an art to unpack
as some people use it, but I believe this will suffice for basic use.
A single regular expression, compiled and cached using the /o
option, is the fastest approach. I ran your code three ways using the Benchmark module and came out with:
Rate unpack substr regexp
unpack 2.59/s -- -59% -67%
substr 6.23/s 141% -- -21%
regexp 7.90/s 206% 27% --
Input was a file with 20k lines, each line had the same 160 characters on it (16 repetitions of the characters 0123456789
). So it's the same input size as the data you're working with.
The Benchmark::cmpthese()
method outputs the subroutine calls from slowest to fastest. The first column is telling us how many times per second the sub-routine can be run. The regular expression approach is fastest. Not unpack as I state previously. Sorry about that.
The benchmark code is below. The print statements are there as sanity checks. This was with Perl 5.10.0 built for darwin-thread-multi-2level.
#!/usr/bin/env perl
use Benchmark qw(:all);
use strict;
sub use_substr() {
print "use_substr(): New itteration\n";
open(F, "<data.txt") or die $!;
while (my $line = <F>) {
my($nom_stat,
$lname,
$fname,
$mname,
$address,
$city,
$st,
$zip,
$lnum,
$cl_rank,
$ceeb,
$county,
$sex,
$grant_type,
$int_major,
$acad_idx,
$gpa,
$hs_cl_size) = (substr($line,0,1),
substr($line,1,15),
substr($line,16,15),
substr($line,31,1),
substr($line,32,30),
substr($line,62,20),
substr($line,82,2),
substr($line,84,5),
substr($line,93,9),
substr($line,108,4),
substr($line,112,6),
substr($line,118,2),
substr($line,120,1),
substr($line,121,1),
substr($line,122,3),
substr($line,125,3),
substr($line,128,5),
substr($line,135,4));
#print "use_substr(): \$lname = $lname\n";
#print "use_substr(): \$gpa = $gpa\n";
}
close(F);
return 1;
}
sub use_regexp() {
print "use_regexp(): New itteration\n";
my $pattern = '^(.{1})(.{15})(.{15})(.{1})(.{30})(.{20})(.{2})(.{5})(.{9})(.{4})(.{6})(.{2})(.{1})(.{1})(.{3})(.{3})(.{5})(.{4})';
open(F, "<data.txt") or die $!;
while (my $line = <F>) {
if ( $line =~ m/$pattern/o ) {
my($nom_stat,
$lname,
$fname,
$mname,
$address,
$city,
$st,
$zip,
$lnum,
$cl_rank,
$ceeb,
$county,
$sex,
$grant_type,
$int_major,
$acad_idx,
$gpa,
$hs_cl_size) = ( $1,
$2,
$3,
$4,
$5,
$6,
$7,
$8,
$9,
$10,
$11,
$12,
$13,
$14,
$15,
$16,
$17,
$18);
#print "use_regexp(): \$lname = $lname\n";
#print "use_regexp(): \$gpa = $gpa\n";
}
}
close(F);
return 1;
}
sub use_unpack() {
print "use_unpack(): New itteration\n";
open(F, "<data.txt") or die $!;
while (my $line = <F>) {
my($nom_stat,
$lname,
$fname,
$mname,
$address,
$city,
$st,
$zip,
$lnum,
$cl_rank,
$ceeb,
$county,
$sex,
$grant_type,
$int_major,
$acad_idx,
$gpa,
$hs_cl_size) = unpack(
"(A1)(A15)(A15)(A1)(A30)(A20)(A2)(A5)(A9)(A4)(A6)(A2)(A1)(A1)(A3)(A3)(A5)(A4)(A*)", $line
);
#print "use_unpack(): \$lname = $lname\n";
#print "use_unpack(): \$gpa = $gpa\n";
}
close(F);
return 1;
}
# Benchmark it
my $itt = 50;
cmpthese($itt, {
'substr' => sub { use_substr(); },
'regexp' => sub { use_regexp(); },
'unpack' => sub { use_unpack(); },
}
);
exit(0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With