Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up UTF-8 string processing

Tags:

string

find

rust

I am parsing tab-separated values:

pub fn parse_tsv(line: &str) -> MyType {
    for (i, value) in line.split('\t').enumerate() {
        // ...
    }
    // ...
}

perf top contains str.find. When I look in the generated assembly code, there is much work related to UTF-8 coding of the symbols in &str.

And it is relatively veeeery slow. It takes 99% of the execution time.

But to find \t I can't simply search for one-byte \t in a UTF-8 string.

What am I doing wrong? What is Rust stdlib doing wrong?

Or maybe in Rust there is a some string library which can represent strings simply by 'u8' bytes? But with all the split(), find(), and other methods?

like image 843
vladon Avatar asked Jan 12 '17 08:01

vladon


1 Answers

As long as your string is ASCII or you don't need to match on UTF-8 scalars (e.g. like in your case, where you search for tabs), you can just treat it as bytes with the as_bytes() method and afterwards operate on u8 characters (bytes) instead of chars (UTF-8 scalars). This should be much faster. With &[u8], which is a slice, you can still use methods applicable to &str slices like split(), find(), etc.

let line = String::new();
let bytes = line.as_bytes();

pub fn parse_tsv(line: &[u8]) {
    for (i, value) in line.split(|c| *c == b'\t').enumerate() {

    }
}

fn main() {
    let line = String::new();
    let bytes = line.as_bytes();

    parse_tsv(&bytes)
}
like image 136
ljedrz Avatar answered Oct 10 '22 00:10

ljedrz