Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I read a non-UTF8 file line by line in Rust

Tags:

utf-8

rust

I'm trying to read a file one line at a time in Rust, and started by following the advice in this question:

use std::error::Error;
use std::fs::File;
use std::io::{BufRead, BufReader};

fn main() -> Result<(), Box<dyn Error>> {
    let file = File::open("countries.txt")?;
    let reader = BufReader::new(file);
    for line in reader.lines() {
        match line {
            Ok(line) => println!("Ok: {}", line),
            Err(error) => println!("Err: {}", error),
        }
    }
    return Ok(());
}

However, I have non-UTF8 files. The Python chardet.universaldetector library tells me this is ISO-8859-1:

Cuba
Curaçao
Cyprus
Czech Republic
Côte d'Ivoire

Out of the box, Rust is unable to interpret the lines with non-UTF8 characters:

$ ./target/release/main1 
Ok: Cuba
Err: stream did not contain valid UTF-8
Ok: Cyprus
Ok: Czech Republic
Err: stream did not contain valid UTF-8

So I tried the encoding_rs_io library. I'm using Windows 1252 here instead of ISO-8859-1, but it seems to work with this data:

use std::error::Error;
use std::fs::File;
use std::io::Read;

use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;

fn main() -> Result<(), Box<dyn Error>> {
    let file = File::open("countries.txt")?;
    let mut reader = DecodeReaderBytesBuilder::new().encoding(Some(WINDOWS_1252)).build(file);
    let mut buffer = vec![];
    reader.read_to_end(&mut buffer)?;
    println!("{}", String::from_utf8(buffer).unwrap());
    return Ok(());
}

This successfully reads the UTF8 characters:

$ ./target/release/main2 
Cuba
Curaçao
Cyprus
Czech Republic
Côte d'Ivoire

However, it does not have a lines() method, so cannot read one line at a time. I note that the ripgrep project uses this library to decode non-UTF8 files, and I've stepped into its source code in a debugger. As far as I can tell, it's doing its own hand-rolled CR/LF detection.

So, surely the task of reading non-UTF8 files one line at a time in Rust must have been solved already. Do I really need to reinvent the wheel on this? Help gratefully appreciated!

like image 360
Huw Walters Avatar asked Dec 31 '22 20:12

Huw Walters


1 Answers

DecodeReaderBytes implements io::Read, so you should be able to wrap it in a std::io::BufReader and use its lines method:

use std::error::Error;
use std::fs::File;
use std::io::{BufReader, BufRead, Read};

use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;

fn main() -> Result<(), Box<dyn Error>> {
    let file = File::open("countries.txt")?;
    let mut reader = BufReader::new(
        DecodeReaderBytesBuilder::new()
            .encoding(Some(WINDOWS_1252))
            .build(file));
    for line in reader.lines() {
        println!("{}", line);
    }
    return Ok(());
}
like image 172
Jmb Avatar answered Jan 05 '23 18:01

Jmb