Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which situations benefit from Perl's study?

Tags:

perl

I'm playing around with study, a Perl feature to examine a string to make subsequent regular expressions potentially much speedier:

while( <> ) {
    study;
    $count++ if /PATTERN/;
    $count++ if /OTHER/;
    $count++ if /PATTERN2/;
    }

There's not much said about which situations will benefit from this. A few things you can tease out of the docs:

  • Patterns with constant strings
  • Multiple patterns
  • Shorter target strings might be better (takes less time to study)

I'm looking for concrete cases where I not only can demonstrate a big advantage, but also cases that I can slightly tweak to lose that advantage. One of the warnings in the docs is that you should benchmark individual cases. I want to find some of the edge cases where a small difference in a string (or pattern) makes a big difference in performance.

If you haven't used study, please don't answer. I'd rather have well-formed correct answers instead fast guesses. There's no urgency here, and this isn't holding up any work.

And, as a bonus, I've been playing with a benchmarking tool comparing two NYTProf runs, which I'd rather use than the usual benchmarking tool. If I come up with a way to automate that, I'll share that too.

like image 324
brian d foy Avatar asked Dec 05 '11 09:12

brian d foy


4 Answers

Google turned up this lovely test scenario:

#!/usr/bin/perl
# 
#  Exercise 7.8 
# 
# This is a more difficult exercise. The study function in Perl may speed up searches 
# for motifs in DNA or protein. Read the Perl documentation on this function. Its use 
# is simple: given some sequence data in a variable $sequence, type:
# 
# study $sequence;
# 
# before doing the searches. Do you think study will speed up searches in DNA or 
# protein, based on what you've read about it in the documentation?
# 
# For lots of extra credit! Now read the Perl documentation on the standard module 
# Benchmark. (Type perldoc Benchmark, or visit the Perl home page at http://www.
# perl.com.) See if your guess is right by writing a program that benchmarks motif 
# searches of DNA and of protein, with and without study.
#
# Answer to Exercise 7.8

use strict;
use warnings;

use Benchmark;

my $dna = join ('', qw(
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat
cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca
ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga
gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga
gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag
ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca
gggacaggggttggggccatgcttgctcggggctctgcttcgccccacaa
atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc
agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca
tgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaa
gaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagt
gccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctca
ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac
ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg
agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc
acacctgagccactctcagatgaggaccta
));

my $protein = join('', qw(
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD
SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ
GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR
));

my $count = 1000;

print "DNA pattern matches without 'study' function:\n";
timethis($count,
    ' for(my $i=1 ; $i < 10000; ++$i) {
        $dna =~ /aggtc/;
        $dna =~ /aatggccgt/;
        $dna =~ /gatcgatcagctagcat/;
        $dna =~ /gtatgaac/;
        $dna =~ /[ac][cg][gt][ta]/;
        $dna =~ /ccccccccc/;
    } '
);

print "\nDNA pattern matches with 'study' function:\n";
timethis($count,
    ' study $dna;
    for(my $i=1 ; $i < 10000; ++$i) {
        $dna =~ /aggtc/;
        $dna =~ /aatggccgt/;
        $dna =~ /gatcgatcagctagcat/;
        $dna =~ /gtatgaac/;
        $dna =~ /[ac][cg][gt][ta]/;
        $dna =~ /ccccccccc/;
    } '
);

print "\nProtein pattern matches without 'study' function:\n";
timethis($count,
    ' for(my $i=1 ; $i < 10000; ++$i) {
        $protein =~ /PH.EI/;
        $protein =~ /KFTEQGESMRLY/;
        $protein =~ /[YAL][NVP][ISV][KQE]/;
        $protein =~ /DKKQIR/;
        $protein =~ /[MD][VT][HQ][ER]/;
        $protein =~ /NVPISVKQEITFTDVSEQL/;
    } '
);

print "\nProtein pattern matches with 'study' function:\n";
timethis($count,
    ' study $protein;
    for(my $i=1 ; $i < 10000; ++$i) {
        $protein =~ /PH.EI/;
        $protein =~ /KFTEQGESMRLY/;
        $protein =~ /[YAL][NVP][ISV][KQE]/;
        $protein =~ /DKKQIR/;
        $protein =~ /[MD][VT][HQ][ER]/;
        $protein =~ /NVPISVKQEITFTDVSEQL/;
    } '
);

Note that the reported gain is only around ~2% for the most profitable case (protein matches):

#  $ perl exer07.08
# On my computer, this is the output I get: your results probably vary.

#  DNA pattern matches without 'study' function:
#  timethis 1000: 29 wallclock secs (29.25 usr +  0.00 sys = 29.25 CPU) @ 34.19/s (n=1000)
#  
#  DNA pattern matches with 'study' function:
#  timethis 1000: 30 wallclock secs (29.21 usr +  0.15 sys = 29.36 CPU) @ 34.06/s (n=1000)
#  
#  Protein pattern matches without 'study' function:
#  timethis 1000: 32 wallclock secs (29.47 usr +  0.04 sys = 29.51 CPU) @ 33.89/s (n=1000)
#  
#  Protein pattern matches with 'study' function:
#  timethis 1000: 30 wallclock secs (28.97 usr +  0.02 sys = 28.99 CPU) @ 34.49/s (n=1000)
#  
like image 85
sehe Avatar answered Nov 15 '22 01:11

sehe


I'm going to leave notes as an answer, and later I'll develop it into an actual answer:

In pp.c's PP(pp_study), it has these curious lines (minus a comment):

if (len == 0 || len > I32_MAX || !SvPOK(sv) || SvUTF8(sv) || SvVALID(sv)) {
RETPUSHNO;
}

It looks like scalars with the UTF8 flag set aren't studied at all.

like image 21
brian d foy Avatar answered Nov 15 '22 03:11

brian d foy


Not really. If you search, and most results are in Perl test suite, that means nobody uses it. Also, because of bug, you could only notice speed benefits on global variables. It actually brought some speed enhancements when dealing with English (sometimes even 2 times faster), but you had to make variable global.

It also sometimes caused infinite loops or false positives (study could add bugs to your program, even when it was just supposed to make it faster), and because of that it was removed (or rather, made no-op) in Perl 5.16 – nobody wanted to maintain a part nobody cares about anyway.

like image 2
Konrad Borowski Avatar answered Nov 15 '22 01:11

Konrad Borowski


None. Since 2012, study does nothing.

Currently the code has

if (len == 0 || len > I32_MAX || !SvPOK(sv) || SvUTF8(sv) || SvVALID(sv)) {
    /* Historically, study was skipped in these cases. */
    SETs(&PL_sv_no);
    return NORMAL;
}

/* Make study a no-op. It's no longer useful and its existence
   complicates matters elsewhere. */
SETs(&PL_sv_yes);
return NORMAL;

which means that study returns true in the case where it would formerly have done something, and false otherwise -- but it never actually does anything.

like image 2
hobbs Avatar answered Nov 15 '22 03:11

hobbs