Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better way to parse and search a string?

Tags:

python

rust

I have been looking to speed up a basic Python function which basically just takes a line of text and checks the line for a substring. The Python program is as follows:

import time

def fun(line):
    l = line.split(" ", 10)
    if 'TTAGGG' in l[9]:
        pass  # Do nothing

line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062 CATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTTTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCTTAGGGGATAGCATTG bbb^Wcbbbbccbbbcbccbba]WQG^bbcdcb_^_c_^`ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM:i:1 AS:i:85 XS:i:65 RG:Z:1_DB31"

time0 = time.time()
for i in range(10000):
    fun(line)
print time.time() - time0

I wanted to see if I could use some of the high level features of Rust to possibly gain some performance, but the code runs considerably slower. The Rust conversion is:

extern crate regex;
extern crate time;
use regex::Regex;

fn main() {
    let line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062 CATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTTTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCTTAGGGGATAGCATTG bbb^Wcbbbbccbbbcbccbba]WQG^bbcdcb_^_c_^`ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM:i:1 AS:i:85 XS:i:65 RG:Z:1_DB31";    
    let substring: &str = "TTAGGG";
    let time0: f64 = time::precise_time_s();

    for _ in 0..10000 {
        fun(line, substring);
    }

    let time1: f64 = time::precise_time_s();
    let elapsed: f64 = time1 - time0;
    println!("{}", elapsed);
}


fn fun(line: &str, substring: &str) {
    let l: Vec<&str> = line.split(" ")
                .enumerate()
                .filter(|&(i, _)| i==9)
                .map(|(_, e) | e)
                .collect();

    let re = Regex::new(substring).unwrap();    
    if re.is_match(&l[0]) {
        // Do nothing
    }
}

On my machine, Python times this at 0.0065s vs Rusts 1.3946s.

Just checking some basic timings, the line.split() part of the code takes around 1s, and the regex step is around 0.4s. Can this really be right, or is there an issue with timing this properly?

like image 827
kezzos Avatar asked Dec 20 '22 01:12

kezzos


1 Answers

As a baseline, I ran your Python program with Python 2.7.6. Over 10 runs, it had a mean time of 12.2ms with a standard deviation of 443μs. I don't know how you got the very good time of 6.5ms.

Running your Rust code with Rust 1.4.0-dev (febdc3b20), without optimizations, I got a mean of 958ms and a standard deviation of 33ms.

Running your code with optimizations (cargo run --release), I got a mean of 34.6ms and standard deviation of 495μs. Always do benchmarking in release mode.

There are further optimizations you can do:

Compiling the regex once, outside of the timing loop:

fn main() {
    // ...
    let substring = "TTAGGG";
    let re = Regex::new(substring).unwrap();

    // ...

    for _ in 0..10000 {
        fun(line, &re);
    }

    // ...
}

fn fun(line: &str, re: &Regex) {
    // ...
}

Produces an average of 10.4ms with a standard deviation of 678μs.

Switching to a substring match:

fn fun(line: &str, substring: &str) {
    // ...

    if l[0].contains(substring) {
        // Do nothing
    }
}

Has a mean of 8.7ms and a standard deviation of 334μs.

And finally, if you look at just the one result instead of collecting everything into a vector:

fn fun(line: &str, substring: &str) {
    let col = line.split(" ").nth(9);

    if col.map(|c| c.contains(substring)).unwrap_or(false) {
        // Do nothing
    }
}

Has a mean of 6.30ms and standard deviation of 114μs.

like image 136
Shepmaster Avatar answered Dec 26 '22 11:12

Shepmaster