I'm playing around with study, a Perl feature to examine a string to make subsequent regular expressions potentially much speedier:
while( <> ) {
study;
$count++ if /PATTERN/;
$count++ if /OTHER/;
$count++ if /PATTERN2/;
}
There's not much said about which situations will benefit from this. A few things you can tease out of the docs:
I'm looking for concrete cases where I not only can demonstrate a big advantage, but also cases that I can slightly tweak to lose that advantage. One of the warnings in the docs is that you should benchmark individual cases. I want to find some of the edge cases where a small difference in a string (or pattern) makes a big difference in performance.
If you haven't used study, please don't answer. I'd rather have well-formed correct answers instead fast guesses. There's no urgency here, and this isn't holding up any work.
And, as a bonus, I've been playing with a benchmarking tool comparing two NYTProf runs, which I'd rather use than the usual benchmarking tool. If I come up with a way to automate that, I'll share that too.
Google turned up this lovely test scenario:
#!/usr/bin/perl
#
# Exercise 7.8
#
# This is a more difficult exercise. The study function in Perl may speed up searches
# for motifs in DNA or protein. Read the Perl documentation on this function. Its use
# is simple: given some sequence data in a variable $sequence, type:
#
# study $sequence;
#
# before doing the searches. Do you think study will speed up searches in DNA or
# protein, based on what you've read about it in the documentation?
#
# For lots of extra credit! Now read the Perl documentation on the standard module
# Benchmark. (Type perldoc Benchmark, or visit the Perl home page at http://www.
# perl.com.) See if your guess is right by writing a program that benchmarks motif
# searches of DNA and of protein, with and without study.
#
# Answer to Exercise 7.8
use strict;
use warnings;
use Benchmark;
my $dna = join ('', qw(
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat
cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca
ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga
gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga
gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag
ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca
gggacaggggttggggccatgcttgctcggggctctgcttcgccccacaa
atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc
agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca
tgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaa
gaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagt
gccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctca
ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac
ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg
agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc
acacctgagccactctcagatgaggaccta
));
my $protein = join('', qw(
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD
SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ
GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR
));
my $count = 1000;
print "DNA pattern matches without 'study' function:\n";
timethis($count,
' for(my $i=1 ; $i < 10000; ++$i) {
$dna =~ /aggtc/;
$dna =~ /aatggccgt/;
$dna =~ /gatcgatcagctagcat/;
$dna =~ /gtatgaac/;
$dna =~ /[ac][cg][gt][ta]/;
$dna =~ /ccccccccc/;
} '
);
print "\nDNA pattern matches with 'study' function:\n";
timethis($count,
' study $dna;
for(my $i=1 ; $i < 10000; ++$i) {
$dna =~ /aggtc/;
$dna =~ /aatggccgt/;
$dna =~ /gatcgatcagctagcat/;
$dna =~ /gtatgaac/;
$dna =~ /[ac][cg][gt][ta]/;
$dna =~ /ccccccccc/;
} '
);
print "\nProtein pattern matches without 'study' function:\n";
timethis($count,
' for(my $i=1 ; $i < 10000; ++$i) {
$protein =~ /PH.EI/;
$protein =~ /KFTEQGESMRLY/;
$protein =~ /[YAL][NVP][ISV][KQE]/;
$protein =~ /DKKQIR/;
$protein =~ /[MD][VT][HQ][ER]/;
$protein =~ /NVPISVKQEITFTDVSEQL/;
} '
);
print "\nProtein pattern matches with 'study' function:\n";
timethis($count,
' study $protein;
for(my $i=1 ; $i < 10000; ++$i) {
$protein =~ /PH.EI/;
$protein =~ /KFTEQGESMRLY/;
$protein =~ /[YAL][NVP][ISV][KQE]/;
$protein =~ /DKKQIR/;
$protein =~ /[MD][VT][HQ][ER]/;
$protein =~ /NVPISVKQEITFTDVSEQL/;
} '
);
Note that the reported gain is only around ~2% for the most profitable case (protein matches):
# $ perl exer07.08
# On my computer, this is the output I get: your results probably vary.
# DNA pattern matches without 'study' function:
# timethis 1000: 29 wallclock secs (29.25 usr + 0.00 sys = 29.25 CPU) @ 34.19/s (n=1000)
#
# DNA pattern matches with 'study' function:
# timethis 1000: 30 wallclock secs (29.21 usr + 0.15 sys = 29.36 CPU) @ 34.06/s (n=1000)
#
# Protein pattern matches without 'study' function:
# timethis 1000: 32 wallclock secs (29.47 usr + 0.04 sys = 29.51 CPU) @ 33.89/s (n=1000)
#
# Protein pattern matches with 'study' function:
# timethis 1000: 30 wallclock secs (28.97 usr + 0.02 sys = 28.99 CPU) @ 34.49/s (n=1000)
#
I'm going to leave notes as an answer, and later I'll develop it into an actual answer:
In pp.c's PP(pp_study)
, it has these curious lines (minus a comment):
if (len == 0 || len > I32_MAX || !SvPOK(sv) || SvUTF8(sv) || SvVALID(sv)) {
RETPUSHNO;
}
It looks like scalars with the UTF8 flag set aren't studied at all.
Not really. If you search, and most results are in Perl test suite, that means nobody uses it. Also, because of bug, you could only notice speed benefits on global variables. It actually brought some speed enhancements when dealing with English (sometimes even 2 times faster), but you had to make variable global.
It also sometimes caused infinite loops or false positives (study
could add bugs to your program, even when it was just supposed to make it faster), and because of that it was removed (or rather, made no-op) in Perl 5.16 – nobody wanted to maintain a part nobody cares about anyway.
None. Since 2012, study does nothing.
Currently the code has
if (len == 0 || len > I32_MAX || !SvPOK(sv) || SvUTF8(sv) || SvVALID(sv)) {
/* Historically, study was skipped in these cases. */
SETs(&PL_sv_no);
return NORMAL;
}
/* Make study a no-op. It's no longer useful and its existence
complicates matters elsewhere. */
SETs(&PL_sv_yes);
return NORMAL;
which means that study
returns true in the case where it would formerly have done something, and false otherwise -- but it never actually does anything.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With