While CStr
is typically used for FFI, I am reading from a &[u8]
which is NUL-terminated and is ensured to be valid UTF-8 so no checks are needed.
However the NUL terminator isn't necessarily at the end of the slice. What's a good way to get this as a &str
?
It was suggested to use CStr::from_bytes_with_nul
, but this panics on an interior \0
character (when the \0
isn't the last character).
I would use iterator adaptors to find the index of the first zero byte:
pub unsafe fn str_from_u8_nul_utf8_unchecked(utf8_src: &[u8]) -> &str {
let nul_range_end = utf8_src.iter()
.position(|&c| c == b'\0')
.unwrap_or(utf8_src.len()); // default to length if no `\0` present
::std::str::from_utf8_unchecked(&utf8_src[0..nul_range_end])
}
This has the major advantage of requiring one to catch all cases (like no 0 in the array).
If you want the version that checks for well-formed UTF-8:
pub fn str_from_u8_nul_utf8(utf8_src: &[u8]) -> Result<&str, std::str::Utf8Error> {
let nul_range_end = utf8_src.iter()
.position(|&c| c == b'\0')
.unwrap_or(utf8_src.len()); // default to length if no `\0` present
::std::str::from_utf8(&utf8_src[0..nul_range_end])
}
Three possible other ways of doing this, mostly using only functions from std.
use std::ffi::CStr;
use std::str;
fn str_from_null_terminated_utf8_safe(s: &[u8]) -> &str {
if s.iter().any(|&x| x == 0) {
unsafe { str_from_null_terminated_utf8(s) }
} else {
str::from_utf8(s).unwrap()
}
}
// unsafe: s must contain a null byte
unsafe fn str_from_null_terminated_utf8(s: &[u8]) -> &str {
CStr::from_ptr(s.as_ptr() as *const _).to_str().unwrap()
}
// unsafe: s must contain a null byte, and be valid utf-8
unsafe fn str_from_null_terminated_utf8_unchecked(s: &[u8]) -> &str {
str::from_utf8_unchecked(CStr::from_ptr(s.as_ptr() as *const _).to_bytes())
}
As a slight aside: benchmark results for all the options in this thread:
With s = b"\0"
test dtwood::bench_str_from_null_terminated_utf8 ... bench: 9 ns/iter (+/- 0)
test dtwood::bench_str_from_null_terminated_utf8_safe ... bench: 10 ns/iter (+/- 3)
test dtwood::bench_str_from_null_terminated_utf8_unchecked ... bench: 5 ns/iter (+/- 1)
test ideasman42::bench_str_from_u8_nul_utf8_unchecked ... bench: 1 ns/iter (+/- 0)
test ker::bench_str_from_u8_nul_utf8 ... bench: 4 ns/iter (+/- 0)
test ker::bench_str_from_u8_nul_utf8_unchecked ... bench: 1 ns/iter (+/- 0)
with s = b"abcdefghij\0klmnop"
test dtwood::bench_str_from_null_terminated_utf8 ... bench: 15 ns/iter (+/- 2)
test dtwood::bench_str_from_null_terminated_utf8_safe ... bench: 20 ns/iter (+/- 2)
test dtwood::bench_str_from_null_terminated_utf8_unchecked ... bench: 6 ns/iter (+/- 0)
test ideasman42::bench_str_from_u8_nul_utf8_unchecked ... bench: 7 ns/iter (+/- 0)
test ker::bench_str_from_u8_nul_utf8 ... bench: 15 ns/iter (+/- 2)
test ker::bench_str_from_u8_nul_utf8_unchecked ... bench: 5 ns/iter (+/- 0)
with s = b"abcdefghij" * 512 + "\0klmnopqrs"
test dtwood::bench_str_from_null_terminated_utf8 ... bench: 351 ns/iter (+/- 35)
test dtwood::bench_str_from_null_terminated_utf8_safe ... bench: 1,987 ns/iter (+/- 274)
test dtwood::bench_str_from_null_terminated_utf8_unchecked ... bench: 170 ns/iter (+/- 18)
test ideasman42::bench_str_from_u8_nul_utf8_unchecked ... bench: 2,466 ns/iter (+/- 292)
test ker::bench_str_from_u8_nul_utf8 ... bench: 1,971 ns/iter (+/- 209)
test ker::bench_str_from_u8_nul_utf8_unchecked ... bench: 1,828 ns/iter (+/- 205)
So if you're super concerned about performance, probably best to benchmark with your particular data set - dtwood::str:from_null_terminated_utf8_unchecked
seems to perform better with longer strings, but ker::bench_str_from_u8_nul_utf8_unchecked
does better on small (< 20 character) strings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With