I am parsing tab-separated values:
pub fn parse_tsv(line: &str) -> MyType {
for (i, value) in line.split('\t').enumerate() {
// ...
}
// ...
}
perf
top contains str.find
. When I look in the generated assembly code, there is much work related to UTF-8 coding of the symbols in &str
.
And it is relatively veeeery slow. It takes 99% of the execution time.
But to find \t
I can't simply search for one-byte \t
in a UTF-8 string.
What am I doing wrong? What is Rust stdlib doing wrong?
Or maybe in Rust there is a some string library which can represent strings simply by 'u8' bytes? But with all the split()
, find()
, and other methods?
As long as your string is ASCII or you don't need to match on UTF-8 scalars (e.g. like in your case, where you search for tabs), you can just treat it as bytes with the as_bytes()
method and afterwards operate on u8
characters (bytes) instead of char
s (UTF-8 scalars). This should be much faster. With &[u8]
, which is a slice, you can still use methods applicable to &str
slices like split()
, find()
, etc.
let line = String::new();
let bytes = line.as_bytes();
pub fn parse_tsv(line: &[u8]) {
for (i, value) in line.split(|c| *c == b'\t').enumerate() {
}
}
fn main() {
let line = String::new();
let bytes = line.as_bytes();
parse_tsv(&bytes)
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With