Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating strings and identifying substrings is very slow

I'd like to benchmark certain operations in Rust, but I seem to be having some trouble:

fn main(){

    let needle   = (0..100).map(|_| "b").collect::<String>();
    let haystack = (0..100_000).map(|_| "a").collect::<String>();

    println!("Data ready.");

    for _ in 0..1_000_000 {
        if haystack.contains( &needle ) {
            // Stuff...
        }
    }

}

The above takes a very long time to complete while the same operation in Ruby finishes in around 4.5 seconds:

needle   = 'b' * 100
haystack = 'a' * 100_000

puts 'Data ready.'

1_000_000.times do
    haystack.include? needle
end

I can't help but think that I'm doing something fundamentally wrong. What would be the proper way to do this in Rust?

rustc 1.0.0 (a59de37e9 2015-05-13) (built 2015-05-14)
ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]
like image 947
Tasos Laskos Avatar asked May 16 '15 06:05

Tasos Laskos


1 Answers

A fix for this issue was merged today. That means it should be part of the next nightly, and will be expected to be released in Rust 1.3. The fix revived the Two-way substring search implementation that Rust used to have and adapted it to the new Pattern API in the standard library.

The Two-way algorithm is a good match for Rust's libcore since it is a linear time substring search algorithm that uses O(1) space and needs no dynamic allocation.

The particular implementation contains a simple addition that will reject this particular query in the question extremely quickly (and no, it was not written because of this question, it was part of the old code too).

During setup the searcher computes a kind of fingerprint for the needle: For each byte in the needle, take its low 6 bits, which is a number 0-63, then set the corresponding bit in the u64 variable byteset.

let byteset = needle.iter().fold(0, |a, &b| (1 << ((b & 0x3f) as usize)) | a);

Since the needle only contains 'b's, the value of byteset will have only the 34th bit set (98 & 63 == 34).

Now we can test any byte whether it can possibly be part of the needle or not. If its corresponding bit isn't set in byteset, the needle cannot match. Each byte we test in the haystack in this case will be 'a' (97 & 63 == 33), and it can't match. So the algorithm will read a single byte, reject it, and then skip the needle's length.

fn byteset_contains(&self, byte: u8) -> bool {
    (self.byteset >> ((byte & 0x3f) as usize)) & 1 != 0
}

// Quickly skip by large portions unrelated to our substring
if !self.byteset_contains(haystack[self.position + needle.len() - 1]) {
    self.position += needle.len();
    continue 'search;
}

From libcore/str/pattern.rs in rust-lang/rust

like image 194
bluss Avatar answered Oct 21 '22 11:10

bluss