Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest idiomatic I/O routine in Rust for programming contests?

My question has been partially answered, so I've revised it in response to things I've learned from comments and additional experiments.

In summary, I want a fast I/O routine for programming contests, in which problems are solved with a single file and no external crates. It should read in a sequence of whitespace-separated tokens from a BufRead (either stdin or a file). The tokens may be integers, floats or ASCII words, separated by spaces and newlines, so it seems I should support FromStr types generically. A small minority of problems are interactive, meaning not all of the input is available initially, but it always comes in complete lines.

For context, here's the discussion that led me to post here. Someone wrote very fast custom code to parse integers directly from the &[u8] output of BufRead::fill_buf(), but it's not generic in FromStr.

Here is my best solution so far (emphasis on the Scanner struct):

use std::io::{self, prelude::*};

fn solve<B: BufRead, W: Write>(mut scan: Scanner<B>, mut w: W) {
    let n = scan.token();
    let mut a = Vec::with_capacity(n);
    let mut b = Vec::with_capacity(n);
    for _ in 0..n {
        a.push(scan.token::<i64>());
        b.push(scan.token::<i64>());
    }
    let mut order: Vec<_> = (0..n).collect();
    order.sort_by_key(|&i| b[i] - a[i]);
    let ans: i64 = order
        .into_iter()
        .enumerate()
        .map(|(i, x)| a[x] * i as i64 + b[x] * (n - 1 - i) as i64)
        .sum();
    writeln!(w, "{}", ans);
}

fn main() {
    let stdin = io::stdin();
    let stdout = io::stdout();
    let reader = Scanner::new(stdin.lock());
    let writer = io::BufWriter::new(stdout.lock());
    solve(reader, writer);
}

pub struct Scanner<B> {
    reader: B,
    buf_str: String,
    buf_iter: std::str::SplitWhitespace<'static>,
}
impl<B: BufRead> Scanner<B> {
    pub fn new(reader: B) -> Self {
        Self {
            reader,
            buf_str: String::new(),
            buf_iter: "".split_whitespace(),
        }
    }
    pub fn token<T: std::str::FromStr>(&mut self) -> T {
        loop {
            if let Some(token) = self.buf_iter.next() {
                return token.parse().ok().expect("Failed parse");
            }
            self.buf_str.clear();
            self.reader
                .read_line(&mut self.buf_str)
                .expect("Failed read");
            self.buf_iter = unsafe { std::mem::transmute(self.buf_str.split_whitespace()) };
        }
    }
}

By avoiding unnecessary allocations, this Scanner is quite fast. If we didn't care about unsafety, it can be made even faster by, instead of doing read_line() into a String, doing read_until(b'\n') into a Vec<u8>, followed by str::from_utf8_unchecked().

However, I'd also like to know what's the fastest safe solution. Is there a clever way to tell Rust that what my Scanner implementation does is actually safe, eliminating the mem::transmute? Intuitively, it seems we should think of the SplitWhitespace object as owning the buffer until it's effectively dropped after it returns None.

All else being equal, I'd like a "nice" idiomatic standard library solution, as I'm trying to present Rust to others who do programming contests.

like image 203
EbTech Avatar asked Jun 04 '19 17:06

EbTech


1 Answers

I'm so glad you asked, as I solved this exact problem in my LibCodeJam rust implementation. Specifically, reading raw tokens from a BufRead is handled by the TokensReader type as well as some tiny related helpers.

Here's the relevant excerpt. The basic idea here is to scan the BufRead::fill_buf buffer for whitespace, and copying non-whitespace characters into a local buffer, which is reused between token calls. Once a whitespace character is found, or the stream ends, the local buffer is interpreted as UTF-8 and returned as an &str.

#[derive(Debug)]
pub enum LoadError {
    Io(io::Error),
    Utf8Error(Utf8Error),
    OutOfTokens,
}

/// TokenBuffer is a resuable buffer into which tokens are
/// read into, one-by-one. It is cleared but not deallocated
/// between each token.
#[derive(Debug)]
struct TokenBuffer(Vec<u8>);

impl TokenBuffer {
    /// Clear the buffer and start reading a new token
    fn lock(&mut self) -> TokenBufferLock {
        self.0.clear();
        TokenBufferLock(&mut self.0)
    }
}

/// TokenBufferLock is a helper type that helps manage the lifecycle
/// of reading a new token, then interpreting it as UTF-8.
#[derive(Debug, Default)]
struct TokenBufferLock<'a>(&'a mut Vec<u8>);

impl<'a> TokenBufferLock<'a> {
    /// Add some bytes to a token
    fn extend(&mut self, chunk: &[u8]) {
        self.0.extend(chunk)
    }

    /// Complete the token and attempt to interpret it as UTF-8
    fn complete(self) -> Result<&'a str, LoadError> {
        from_utf8(self.0).map_err(LoadError::Utf8Error)
    }
}

pub struct TokensReader<R: io::BufRead> {
    reader: R,
    token: TokenBuffer,
}

impl<R: io::BufRead> Tokens for TokensReader<R> {
    fn next_raw(&mut self) -> Result<&str, LoadError> {
        use std::io::ErrorKind::Interrupted;

        // Clear leading whitespace
        loop {
            match self.reader.fill_buf() {
                Err(ref err) if err.kind() == Interrupted => continue,
                Err(err) => return Err(LoadError::Io(err)),
                Ok([]) => return Err(LoadError::OutOfTokens),
                // Got some content; scan for the next non-whitespace character
                Ok(buf) => match buf.iter().position(|byte| !byte.is_ascii_whitespace()) {
                    Some(i) => {
                        self.reader.consume(i);
                        break;
                    }
                    None => self.reader.consume(buf.len()),
                },
            };
        }

        // If we reach this point, there is definitely a non-empty token ready to be read.
        let mut token_buf = self.token.lock();

        loop {
            match self.reader.fill_buf() {
                Err(ref err) if err.kind() == Interrupted => continue,
                Err(err) => return Err(LoadError::Io(err)),
                Ok([]) => return token_buf.complete(),
                // Got some content; scan for the next whitespace character
                Ok(buf) => match buf.iter().position(u8::is_ascii_whitespace) {
                    Some(i) => {
                        token_buf.extend(&buf[..i]);
                        self.reader.consume(i + 1);
                        return token_buf.complete();
                    }
                    None => {
                        token_buf.extend(buf);
                        self.reader.consume(buf.len());
                    }
                },
            }
        }
    }
}

This implementation doesn't handle parsing strings into FromStr types— that's handled separately— but it does handle correctly accumulating bytes, separating them into whitespace-separated tokens, and interpreting those tokens as UTF-8. It does assume that only ASCII whitespace will be used to separate Tokens.

It's worth noting that FromStr cannot be used directly on the fill_buf buffer, because there's no guarantee that a token doesn't straddle the boundary between two fill_buf calls, and there's no way to force a BufRead to read more bytes until the existing buffer is fully consumed. I'm assuming it's pretty obvious that once you have an Ok(&str), you can perform FromStr on it at your leisure.

This implementation is not 0-copy, but is is (amortized) 0-allocation, and it minimizes unnecessary copying or buffering. It uses a single persistent buffer that is only resized if it's too small for a single token, and it reuses this buffer between tokens. Bytes are copied into this buffer directly from the input BufRead buffer, without extra intermediary copying.

like image 131
Lucretiel Avatar answered Nov 03 '22 11:11

Lucretiel