I was hoping to use Rust to speed up some of the text processing scrips that are currently written in Python.
In order to test the performance of the two languages, I decided to test them on a very simple task:
STDIN
, line by line.>
, save the line to a headers.txt
file.sequences.txt
file.For this test, I am using a fasta file with 10 million lines, which looks as follows:
$ head uniparc_active-head.fasta
>UPI0000000001 status=active
MGAAASIQTTVNTLSERISSKLEQEANASAQTKCDIEIGNFYIRQNHGCNLTVKNMCSAD
ADAQLDAVLSAATETYSGLTPEQKAYVPAMFTAALNIQTSVNTVVRDFENYVKQTCNSSA
VVDNKLKIQNVIIDECYGAPGSPTNLEFINTGSSKGNCAIKALMQLTTKATTQIAPKQVA
GTGVQFYMIVIGVIILAALFMYYAKRMLFTSTNDKIKLILANKENVHWTTYMDTFFRTSP
MVIATTDMQN
>UPI0000000002 status=active
MMTPENDEEQTSVFSATVYGDKIQGKNKRKRVIGLCIRISMVISLLSMITMSAFLIVRLN
QCMSANEAAITDAAVAVAAASSTHRKVASSTTQYDHKESCNGLYYQGSCYILHSDYQLFS
DAKANCTAESSTLPNKSDVLITWLIDYVEDTWGSDGNPITKTTSDYQDSDVSQEVRKYFC
Here is my Python script:
import fileinput
with open('headers.txt', 'w') as hof, \
open('sequences.txt', 'w') as sof:
for line in fileinput.input():
if line[0] == '>':
hof.write(line)
else:
sof.write(line)
and my Rust script (which I compile in cargo build --release
):
use std::io;
use std::fs::File;
use std::io::Write;
use std::io::BufRead;
fn main() {
let stdin = io::stdin();
let mut headers = File::create("headers.txt").unwrap();
let mut sequences = File::create("sequences.txt").unwrap();
for line in stdin.lock().lines() {
let line = line.unwrap();
match &line[..1] {
">" => writeln!(headers, "{}", line).unwrap(),
_ => writeln!(sequences, "{}", line).unwrap(),
}
}
}
Running some benchmarks:
Python 2.7
$ time bash -c 'cat uniparc_active-head.fasta | python2 src/main.py'
real 0m11.704s
user 0m6.996s
sys 0m1.100s
Python 3.5
$ time bash -c 'cat uniparc_active-head.fasta | python3 src/main.py'
real 0m16.788s
user 0m12.508s
sys 0m1.576s
PyPy 5.3.1
$ time bash -c 'cat uniparc_active-head.fasta | pypy src/main.py'
real 0m6.526s
user 0m1.536s
sys 0m0.884s
Rust 1.14.0
$ cargo build --release
$ time bash -c 'cat uniparc_active-head.fasta | target/release/parse_text'
real 0m17.493s
user 0m2.728s
sys 0m15.408s
So Rust is ~3x slower than PyPy, and even slower than Python 3.
Can anyone shine some light on this? Did I make a mistake in the Rust code? If not, should I stick to Python / PyPy for processing text files, or is there another language that would be better for the job?
As suggested by @BurntSushi5, replacing
let mut headers = File::create("headers.txt").unwrap();
let mut sequences = File::create("sequences.txt").unwrap();
with
let mut headers = io::BufWriter::new(File::create("headers.txt").unwrap());
let mut sequences = io::BufWriter::new(File::create("sequences.txt").unwrap());
Brought the speed up to what I expected:
$ time bash -c 'cat uniparc_active-head.fasta | target/release/parse_text'
real 0m5.645s
user 0m1.396s
sys 0m0.804s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With