My question has been partially answered, so I've revised it in response to things I've learned from comments and additional experiments.
In summary, I want a fast I/O routine for programming contests, in which problems are solved with a single file and no external crates. It should read in a sequence of whitespace-separated tokens from a BufRead
(either stdin or a file). The tokens may be integers, floats or ASCII words, separated by spaces and newlines, so it seems I should support FromStr
types generically. A small minority of problems are interactive, meaning not all of the input is available initially, but it always comes in complete lines.
For context, here's the discussion that led me to post here. Someone wrote very fast custom code to parse integers directly from the &[u8]
output of BufRead::fill_buf()
, but it's not generic in FromStr
.
Here is my best solution so far (emphasis on the Scanner
struct):
use std::io::{self, prelude::*};
fn solve<B: BufRead, W: Write>(mut scan: Scanner<B>, mut w: W) {
let n = scan.token();
let mut a = Vec::with_capacity(n);
let mut b = Vec::with_capacity(n);
for _ in 0..n {
a.push(scan.token::<i64>());
b.push(scan.token::<i64>());
}
let mut order: Vec<_> = (0..n).collect();
order.sort_by_key(|&i| b[i] - a[i]);
let ans: i64 = order
.into_iter()
.enumerate()
.map(|(i, x)| a[x] * i as i64 + b[x] * (n - 1 - i) as i64)
.sum();
writeln!(w, "{}", ans);
}
fn main() {
let stdin = io::stdin();
let stdout = io::stdout();
let reader = Scanner::new(stdin.lock());
let writer = io::BufWriter::new(stdout.lock());
solve(reader, writer);
}
pub struct Scanner<B> {
reader: B,
buf_str: String,
buf_iter: std::str::SplitWhitespace<'static>,
}
impl<B: BufRead> Scanner<B> {
pub fn new(reader: B) -> Self {
Self {
reader,
buf_str: String::new(),
buf_iter: "".split_whitespace(),
}
}
pub fn token<T: std::str::FromStr>(&mut self) -> T {
loop {
if let Some(token) = self.buf_iter.next() {
return token.parse().ok().expect("Failed parse");
}
self.buf_str.clear();
self.reader
.read_line(&mut self.buf_str)
.expect("Failed read");
self.buf_iter = unsafe { std::mem::transmute(self.buf_str.split_whitespace()) };
}
}
}
By avoiding unnecessary allocations, this Scanner
is quite fast. If we didn't care about unsafety, it can be made even faster by, instead of doing read_line()
into a String
, doing read_until(b'\n')
into a Vec<u8>
, followed by str::from_utf8_unchecked()
.
However, I'd also like to know what's the fastest safe solution. Is there a clever way to tell Rust that what my Scanner
implementation does is actually safe, eliminating the mem::transmute
? Intuitively, it seems we should think of the SplitWhitespace
object as owning the buffer until it's effectively dropped after it returns None
.
All else being equal, I'd like a "nice" idiomatic standard library solution, as I'm trying to present Rust to others who do programming contests.
I'm so glad you asked, as I solved this exact problem in my LibCodeJam rust implementation. Specifically, reading raw tokens from a BufRead
is handled by the TokensReader
type as well as some tiny related helpers.
Here's the relevant excerpt. The basic idea here is to scan the BufRead::fill_buf
buffer for whitespace, and copying non-whitespace characters into a local buffer, which is reused between token calls. Once a whitespace character is found, or the stream ends, the local buffer is interpreted as UTF-8 and returned as an &str
.
#[derive(Debug)]
pub enum LoadError {
Io(io::Error),
Utf8Error(Utf8Error),
OutOfTokens,
}
/// TokenBuffer is a resuable buffer into which tokens are
/// read into, one-by-one. It is cleared but not deallocated
/// between each token.
#[derive(Debug)]
struct TokenBuffer(Vec<u8>);
impl TokenBuffer {
/// Clear the buffer and start reading a new token
fn lock(&mut self) -> TokenBufferLock {
self.0.clear();
TokenBufferLock(&mut self.0)
}
}
/// TokenBufferLock is a helper type that helps manage the lifecycle
/// of reading a new token, then interpreting it as UTF-8.
#[derive(Debug, Default)]
struct TokenBufferLock<'a>(&'a mut Vec<u8>);
impl<'a> TokenBufferLock<'a> {
/// Add some bytes to a token
fn extend(&mut self, chunk: &[u8]) {
self.0.extend(chunk)
}
/// Complete the token and attempt to interpret it as UTF-8
fn complete(self) -> Result<&'a str, LoadError> {
from_utf8(self.0).map_err(LoadError::Utf8Error)
}
}
pub struct TokensReader<R: io::BufRead> {
reader: R,
token: TokenBuffer,
}
impl<R: io::BufRead> Tokens for TokensReader<R> {
fn next_raw(&mut self) -> Result<&str, LoadError> {
use std::io::ErrorKind::Interrupted;
// Clear leading whitespace
loop {
match self.reader.fill_buf() {
Err(ref err) if err.kind() == Interrupted => continue,
Err(err) => return Err(LoadError::Io(err)),
Ok([]) => return Err(LoadError::OutOfTokens),
// Got some content; scan for the next non-whitespace character
Ok(buf) => match buf.iter().position(|byte| !byte.is_ascii_whitespace()) {
Some(i) => {
self.reader.consume(i);
break;
}
None => self.reader.consume(buf.len()),
},
};
}
// If we reach this point, there is definitely a non-empty token ready to be read.
let mut token_buf = self.token.lock();
loop {
match self.reader.fill_buf() {
Err(ref err) if err.kind() == Interrupted => continue,
Err(err) => return Err(LoadError::Io(err)),
Ok([]) => return token_buf.complete(),
// Got some content; scan for the next whitespace character
Ok(buf) => match buf.iter().position(u8::is_ascii_whitespace) {
Some(i) => {
token_buf.extend(&buf[..i]);
self.reader.consume(i + 1);
return token_buf.complete();
}
None => {
token_buf.extend(buf);
self.reader.consume(buf.len());
}
},
}
}
}
}
This implementation doesn't handle parsing strings into FromStr
types— that's handled separately— but it does handle correctly accumulating bytes, separating them into whitespace-separated tokens, and interpreting those tokens as UTF-8. It does assume that only ASCII whitespace will be used to separate Tokens.
It's worth noting that FromStr
cannot be used directly on the fill_buf
buffer, because there's no guarantee that a token doesn't straddle the boundary between two fill_buf
calls, and there's no way to force a BufRead
to read more bytes until the existing buffer is fully consumed. I'm assuming it's pretty obvious that once you have an Ok(&str)
, you can perform FromStr
on it at your leisure.
This implementation is not 0-copy, but is is (amortized) 0-allocation, and it minimizes unnecessary copying or buffering. It uses a single persistent buffer that is only resized if it's too small for a single token, and it reuses this buffer between tokens. Bytes are copied into this buffer directly from the input BufRead
buffer, without extra intermediary copying.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With