Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a non-UTF8 encoded csv file?

With the csv crate and the latest Rust version 1.31.0, I would want to read CSV files with ANSI (Windows 1252) encoding, as easily as in UTF-8.

Things I have tried (with no luck), after reading the whole file in a Vec<u8>:

  • CString
  • OsString

Indeed, in my company, we have a lot of CSV files, ANSI encoded.

Also, I would want, if possible, not to load the entire file in a Vec<u8> but a reading line by line (CRLF ending), as many of the files are big (50 Mb or more…).

In the file Cargo.toml, I have this dependency:

[dependencies]
csv = "1"

test.csv consist of the following content saved as Windows-1252 encoding:

Café;au;lait
Café;au;lait

The code in main.rs file:

extern crate csv;

use std::error::Error;
use std::fs::File;
use std::io::BufReader;
use std::path::Path;
use std::process;

fn example() -> Result<(), Box<Error>> {
    let file_name = r"test.csv";
    let file_handle = File::open(Path::new(file_name))?;
    let reader = BufReader::new(file_handle);

    let mut rdr = csv::ReaderBuilder::new()
        .delimiter(b';')
        .from_reader(reader);

    // println!("ANSI");
    // for result in rdr.byte_records() {
    //    let record = result?;
    //    println!("{:?}", record);
    // }

    println!("UTF-8");
    for result in rdr.records() {
        let record = result?;
        println!("{:?}", record);
    }
    Ok(())
}

fn main() {
    if let Err(err) = example() {
        println!("error running example: {}", err);
        process::exit(1);
    }
}

The output is:

UTF-8
error running example: CSV parse error: record 0 (line 1, field: 0, byte: 0): invalid utf-8: invalid UTF-8 in field 0 near byte index 3
error: process didn't exit successfully: `target\debug\test-csv.exe` (exit code: 1)

When using rdr.byte_records() (uncommenting the relevant part of code), the output is:

ANSI
ByteRecord(["Caf\\xe9", "au", "lait"])
like image 220
Linker Storm Avatar asked Dec 02 '22 10:12

Linker Storm


1 Answers

I suspect this question is under specified. In particular, it's not clear why your use of the ByteRecord API is insufficient. In the csv crate, byte records specifically exists for exactly cases like this, where your CSV data isn't strictly UTF-8, but is in an alternative encoding such as Windows-1252 that is ASCII compatible. (An ASCII compatible encoding is an encoding in which ASCII is a subset. Windows-1252 and UTF-8 are both ASCII compatible. UTF-16 is not.) Your code sample above shows that you're using byte records, but doesn't explain why this is insufficient.

With that said, if your goal is to get your data into Rust's string data type (String/&str), then your only option is to transcode the contents of your CSV data from Windows-1252 to UTF-8. This is necessary because Rust's string data type uses UTF-8 for its in-memory representation. You cannot have a Rust String/&str that is Windows-1252 encoded because Windows-1252 is not a subset of UTF-8.

Other comments have recommended the use of the encoding crate. However, I would instead recommend the use of encoding_rs, if your use case aligns with the same use cases solved by the Encoding Standard, which is specifically geared towards the web. Fortunately, I believe such an alignment exists.

In order to satisfy your criteria for reading CSV data in a streaming fashion without first loading the entire contents into memory, you need to use a wrapper around the encoding_rs crate that implements streaming decoding for you. The encoding_rs_io crate provides this for you. (It's used inside of ripgrep to do fast streaming decoding before searching UTF-8.)

Here is an example program that puts all of the above together, using Rust 2018:

use std::fs::File;
use std::process;

use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;

fn main() {
    if let Err(err) = try_main() {
        eprintln!("{}", err);
        process::exit(1);
    }
}

fn try_main() -> csv::Result<()> {
    let file = File::open("test.csv")?;
    let transcoded = DecodeReaderBytesBuilder::new()
        .encoding(Some(WINDOWS_1252))
        .build(file);
    let mut rdr = csv::ReaderBuilder::new()
        .delimiter(b';')
        .from_reader(transcoded);
    for result in rdr.records() {
        let r = result?;
        println!("{:?}", r);
    }
    Ok(())
}

with the Cargo.toml:

[package]
name = "so53826986"
version = "0.1.0"
edition = "2018"

[dependencies]
csv = "1"
encoding_rs = "0.8.13"
encoding_rs_io = "0.1.3"

And the output:

$ cargo run --release
   Compiling so53826986 v0.1.0 (/tmp/so53826986)
    Finished release [optimized] target(s) in 0.63s
     Running `target/release/so53826986`
StringRecord(["Café", "au", "lait"])

In particular, if you swap out rdr.records() for rdr.byte_records(), then we can see more clearly what happened:

$ cargo run --release
   Compiling so53826986 v0.1.0 (/tmp/so53826986)
    Finished release [optimized] target(s) in 0.61s
     Running `target/release/so53826986`
ByteRecord(["Caf\\xc3\\xa9", "au", "lait"])

Namely, your input contained Caf\xE9, but the byte record now contains Caf\xC3\xA9. This is a result of translating the Windows-1252 codepoint value of 233 (encoded as its literal byte, \xE9) to U+00E9 LATIN SMALL LETTER E WITH ACUTE, which is UTF-8 encoded as \xC3\xA9.

like image 162
BurntSushi5 Avatar answered Dec 04 '22 23:12

BurntSushi5