Consider this program:
use std::io::BufRead;
use std::io;
fn main() {
let mut n = 0;
let stdin = io::stdin();
for _ in stdin.lock().lines() {
n += 1;
}
println!("{}", n);
}
Why is it over 10x as slow as the GNU version of wc? Take a look at how I measure it:
$ yes | dd count=1000000 | wc -l
256000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.16586 s, 439 MB/s
$ yes | dd count=1000000 | ./target/release/wc
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 41.685 s, 12.3 MB/s
256000000
There are many reasons why your code is way slower than the original wc
. There are a few things you pay for which you actually don't need at all. By removing those, you can already get a quite substantial speed boost.
BufRead::lines()
returns an iterator which yields String
elements. Due to this design, it will (it has to!) allocate memory for every single line. The lines()
method is a convenient method to write code easily, but it isn't supposed to be used in high performance situations.
To avoid allocating heap memory for every single line, you can use BufRead::read_line()
instead. The code is a bit more verbose, but as you can see, we are reusing the heap memory of s
:
let mut n = 0;
let mut s = String::new();
let stdin = io::stdin();
let mut lock = stdin.lock();
loop {
s.clear();
let res = lock.read_line(&mut s);
if res.is_err() || res.unwrap() == 0 {
break;
}
n += 1;
}
println!("{}", n);
On my notebook, this results in:
$ yes | dd count=1000000 | wc -l
256000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 0,981827 s, 521 MB/s
$ yes | dd count=1000000 | ./wc
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 6,87622 s, 74,5 MB/s
256000000
As you can see, it improved things a lot, but is still not equivalent.
Since we are reading into a String
, we are validating the raw input from stdin to be proper UTF-8. This costs time! But we are only interested in the raw bytes, since we only need to count the newline characters (0xA
). We can get rid of UTF-8 checks by using a Vec<u8>
and BufRead::read_until()
:
let mut n = 0;
let mut v = Vec::new();
let stdin = io::stdin();
let mut lock = stdin.lock();
loop {
v.clear();
let res = lock.read_until(0xA, &mut v);
if res.is_err() || res.unwrap() == 0 {
break;
}
n += 1;
}
println!("{}", n);
This results in:
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 4,24162 s, 121 MB/s
256000000
That's a 60% improvement. But the original wc
is still faster by a factor of 3.5x!
Now we used up all the low hanging fruit to boost performance. In order to match wc
s speed, one would have to do some serious profiling, I think. In our current solution, perf
reports the following:
memchr
; I don't think this can be improved<StdinLock as std::io::BufRead>::fill_buf()
and<StdinLock as std::io::BufRead>::consume()
A huge part of the remaining time is spent in main
directly (due to inlining). From the looks of it, we are also paying a bit for the cross platform abstraction. There is some time spent for Mutex
methods and stuff.
But at this point, I'm just guessing, because I don't have the time to look into this further. Sorry :<
But please note, that wc
is an old tool and is highly optimized for the platform it's running on and for the task it's performing. I guess that knowledge about Linux internal things would help performance a lot. This is really specialized, so I wouldn't expect to match the performance that easily.
This is because your version is in no way equivalent to the GNU one which doesn't allocate any memory for strings, but only moves the file pointer and increments different counters. In addition, it processes raw bytes while Rust's String
must be valid UTF-8.
GNU wc source
Here's a version I got from Arnavion on #rust-beginners IRC:
use std::io::Read;
fn main() {
let mut buffer = [0u8; 1024];
let stdin = ::std::io::stdin();
let mut stdin = stdin.lock();
let mut wc = 0usize;
loop {
match stdin.read(&mut buffer) {
Ok(0) => {
break;
},
Ok(len) => {
wc += buffer[0..len].into_iter().filter(|&&b| b == b'\n').count();
},
Err(err) => {
panic!("{}", err);
},
}
};
println!("{}", wc);
}
This gets performance very close to what original wc
does.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With