I am parsing tab-separated values:
pub fn parse_tsv(line: &str) -> MyType {
for (i, value) in line.split('\t').enumerate() {
// ...
}
// ...
}
perf top contains str.find. When I look in the generated assembly code, there is much work related to UTF-8 coding of the symbols in &str.
And it is relatively veeeery slow. It takes 99% of the execution time.
But to find \t I can't simply search for one-byte \t in a UTF-8 string.
What am I doing wrong? What is Rust stdlib doing wrong?
Or maybe in Rust there is a some string library which can represent strings simply by 'u8' bytes? But with all the split(), find(), and other methods?
As long as your string is ASCII or you don't need to match on UTF-8 scalars (e.g. like in your case, where you search for tabs), you can just treat it as bytes with the as_bytes() method and afterwards operate on u8 characters (bytes) instead of chars (UTF-8 scalars). This should be much faster. With &[u8], which is a slice, you can still use methods applicable to &str slices like split(), find(), etc.
let line = String::new();
let bytes = line.as_bytes();
pub fn parse_tsv(line: &[u8]) {
for (i, value) in line.split(|c| *c == b'\t').enumerate() {
}
}
fn main() {
let line = String::new();
let bytes = line.as_bytes();
parse_tsv(&bytes)
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With