Now that the Read::chars
iterator has been officially deprecated, what is the the proper way to obtain an iterator over the chars coming from a Reader
like stdin without reading the entire stream into memory?
The corresponding issue for deprecation nicely sums up the problems with Read::chars
and offers suggestions:
Code that does not care about processing data incrementally can use
Read::read_to_string
instead. Code that does care presumably also wants to control its buffering strategy and work with&[u8]
and&str
slices that are as large as possible, rather than onechar
at a time. It should be based on thestr::from_utf8
function as well as thevalid_up_to
anderror_len
methods of theUtf8Error
type. One tricky aspect is dealing with cases where a singlechar
is represented in UTF-8 by multiple bytes where those bytes happen to be split across separateread
calls / buffer chunks. (Utf8Error::error_len
returningNone
indicates that this may be the case.) Theutf-8
crate solves this, but in order to be flexible provides an API that probably has too much surface to be included in the standard library.Of course the above is for data that is always UTF-8. If other character encoding need to be supported, consider using the
encoding_rs
orencoding
crate.
The most efficient solution in terms of number of I/O calls is to read everything into a giant buffer String
and iterate over that:
use std::io::{self, Read};
fn main() {
let stdin = io::stdin();
let mut s = String::new();
stdin.lock().read_to_string(&mut s).expect("Couldn't read");
for c in s.chars() {
println!(">{}<", c);
}
}
You can combine this with an answer from Is there an owned version of String::chars?:
use std::io::{self, Read};
fn reader_chars<R: Read>(mut rdr: R) -> io::Result<impl Iterator<Item = char>> {
let mut s = String::new();
rdr.read_to_string(&mut s)?;
Ok(s.into_chars()) // from https://stackoverflow.com/q/47193584/155423
}
fn main() -> io::Result<()> {
let stdin = io::stdin();
for c in reader_chars(stdin.lock())? {
println!(">{}<", c);
}
Ok(())
}
We now have a function that returns an iterator of char
s for any type that implements Read
.
Once you have this pattern, it's just a matter of deciding where to make the tradeoff of memory allocation vs I/O requests. Here's a similar idea that uses line-sized buffers:
use std::io::{BufRead, BufReader, Read};
fn reader_chars<R: Read>(rdr: R) -> impl Iterator<Item = char> {
// We use 6 bytes here to force emoji to be segmented for demo purposes
// Pick more appropriate size for your case
let reader = BufReader::with_capacity(6, rdr);
reader
.lines()
.flat_map(|l| l) // Ignoring any errors
.flat_map(|s| s.into_chars()) // from https://stackoverflow.com/q/47193584/155423
}
fn main() {
// emoji are 4 bytes each
let data = "😻🧐🐪💩";
let data = data.as_bytes();
for c in reader_chars(data) {
println!(">{}<", c);
}
}
The far extreme would be to perform one I/O request for every character. This wouldn't take much memory, but would have a lot of I/O overhead.
Copy and paste the implementation of Read::chars
into your own code. It will work as well as it used to.
See also:
As a couple others have mentioned, it is possible to copy the deprecated implementation of Read::chars
for use in your own code. Whether this is truly ideal or not will depend on your use-case--for me, this proved to be good enough for now although it is likely that my application will outgrow this approach in the near-future.
To illustrate how this can be done, let's look at a concrete example:
use std::io::{self, Error, ErrorKind, Read};
use std::result;
use std::str;
struct MyReader<R> {
inner: R,
}
impl<R: Read> MyReader<R> {
fn new(inner: R) -> MyReader<R> {
MyReader {
inner,
}
}
#[derive(Debug)]
enum MyReaderError {
NotUtf8,
Other(Error),
}
impl<R: Read> Iterator for MyReader<R> {
type Item = result::Result<char, MyReaderError>;
fn next(&mut self) -> Option<result::Result<char, MyReaderError>> {
let first_byte = match read_one_byte(&mut self.inner)? {
Ok(b) => b,
Err(e) => return Some(Err(MyReaderError::Other(e))),
};
let width = utf8_char_width(first_byte);
if width == 1 {
return Some(Ok(first_byte as char));
}
if width == 0 {
return Some(Err(MyReaderError::NotUtf8));
}
let mut buf = [first_byte, 0, 0, 0];
{
let mut start = 1;
while start < width {
match self.inner.read(&mut buf[start..width]) {
Ok(0) => return Some(Err(MyReaderError::NotUtf8)),
Ok(n) => start += n,
Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
Err(e) => return Some(Err(MyReaderError::Other(e))),
}
}
}
Some(match str::from_utf8(&buf[..width]).ok() {
Some(s) => Ok(s.chars().next().unwrap());
None => Err(MyReaderError::NotUtf8),
})
}
}
The above code also requires read_one_byte
and utf8_char_width
to be implemented. Those should look something like:
static UTF8_CHAR_WIDTH: [u8; 256] = [
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x1F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x3F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x5F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x7F
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0x9F
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0xBF
0,0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // 0xDF
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3, // 0xEF
4,4,4,4,4,0,0,0,0,0,0,0,0,0,0,0, // 0xFF
];
fn utf8_char_width(b: u8) -> usize {
return UTF8_CHAR_WIDTH[b as usize] as usize;
}
fn read_one_byte(reader: &mut Read) -> Option<io::Result<u8>> {
let mut buf = [0];
loop {
return match reader.read(&mut buf) {
Ok(0) => None,
Ok(..) => Some(Ok(buf[0])),
Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
Err(e) => Some(Err(e)),
};
}
}
Now we can use the MyReader
implementation to produce an iterator of char
s over some reader, like io::stdin::Stdin
:
fn main() {
let stdin = io::stdin();
let mut reader = MyReader::new(stdin.lock());
for c in reader {
println!("{}", c);
}
}
The limitations of this approach are discussed at length in the original issue thread. One particular concern worth pointing out however is that this iterator will not handle non-UTF-8 encoded streams correctly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With