I'm trying to read a file one line at a time in Rust, and started by following the advice in this question:
use std::error::Error;
use std::fs::File;
use std::io::{BufRead, BufReader};
fn main() -> Result<(), Box<dyn Error>> {
let file = File::open("countries.txt")?;
let reader = BufReader::new(file);
for line in reader.lines() {
match line {
Ok(line) => println!("Ok: {}", line),
Err(error) => println!("Err: {}", error),
}
}
return Ok(());
}
However, I have non-UTF8 files. The Python chardet.universaldetector
library tells me this is ISO-8859-1:
Cuba
Curaçao
Cyprus
Czech Republic
Côte d'Ivoire
Out of the box, Rust is unable to interpret the lines with non-UTF8 characters:
$ ./target/release/main1
Ok: Cuba
Err: stream did not contain valid UTF-8
Ok: Cyprus
Ok: Czech Republic
Err: stream did not contain valid UTF-8
So I tried the encoding_rs_io library. I'm using Windows 1252 here instead of ISO-8859-1, but it seems to work with this data:
use std::error::Error;
use std::fs::File;
use std::io::Read;
use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;
fn main() -> Result<(), Box<dyn Error>> {
let file = File::open("countries.txt")?;
let mut reader = DecodeReaderBytesBuilder::new().encoding(Some(WINDOWS_1252)).build(file);
let mut buffer = vec![];
reader.read_to_end(&mut buffer)?;
println!("{}", String::from_utf8(buffer).unwrap());
return Ok(());
}
This successfully reads the UTF8 characters:
$ ./target/release/main2
Cuba
Curaçao
Cyprus
Czech Republic
Côte d'Ivoire
However, it does not have a lines()
method, so cannot read one line at a time. I note that the ripgrep project uses this library to decode non-UTF8 files, and I've stepped into its source code in a debugger. As far as I can tell, it's doing its own hand-rolled CR/LF detection.
So, surely the task of reading non-UTF8 files one line at a time in Rust must have been solved already. Do I really need to reinvent the wheel on this? Help gratefully appreciated!
DecodeReaderBytes
implements io::Read
, so you should be able to wrap it in a std::io::BufReader
and use its lines
method:
use std::error::Error;
use std::fs::File;
use std::io::{BufReader, BufRead, Read};
use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;
fn main() -> Result<(), Box<dyn Error>> {
let file = File::open("countries.txt")?;
let mut reader = BufReader::new(
DecodeReaderBytesBuilder::new()
.encoding(Some(WINDOWS_1252))
.build(file));
for line in reader.lines() {
println!("{}", line);
}
return Ok(());
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With