With the csv crate and the latest Rust version 1.31.0, I would want to read CSV files with ANSI (Windows 1252) encoding, as easily as in UTF-8.
Things I have tried (with no luck), after reading the whole file in a Vec<u8>
:
CString
OsString
Indeed, in my company, we have a lot of CSV files, ANSI encoded.
Also, I would want, if possible, not to load the entire file in a Vec<u8>
but a reading line by line (CRLF
ending), as many of the files are big (50 Mb or more…).
In the file Cargo.toml, I have this dependency:
[dependencies]
csv = "1"
test.csv consist of the following content saved as Windows-1252 encoding:
Café;au;lait
Café;au;lait
The code in main.rs file:
extern crate csv;
use std::error::Error;
use std::fs::File;
use std::io::BufReader;
use std::path::Path;
use std::process;
fn example() -> Result<(), Box<Error>> {
let file_name = r"test.csv";
let file_handle = File::open(Path::new(file_name))?;
let reader = BufReader::new(file_handle);
let mut rdr = csv::ReaderBuilder::new()
.delimiter(b';')
.from_reader(reader);
// println!("ANSI");
// for result in rdr.byte_records() {
// let record = result?;
// println!("{:?}", record);
// }
println!("UTF-8");
for result in rdr.records() {
let record = result?;
println!("{:?}", record);
}
Ok(())
}
fn main() {
if let Err(err) = example() {
println!("error running example: {}", err);
process::exit(1);
}
}
The output is:
UTF-8
error running example: CSV parse error: record 0 (line 1, field: 0, byte: 0): invalid utf-8: invalid UTF-8 in field 0 near byte index 3
error: process didn't exit successfully: `target\debug\test-csv.exe` (exit code: 1)
When using rdr.byte_records()
(uncommenting the relevant part of code), the output is:
ANSI
ByteRecord(["Caf\\xe9", "au", "lait"])
I suspect this question is under specified. In particular, it's not clear why your use of the ByteRecord
API is insufficient. In the csv crate, byte records specifically exists for exactly cases like this, where your CSV data isn't strictly UTF-8, but is in an alternative encoding such as Windows-1252 that is ASCII compatible. (An ASCII compatible encoding is an encoding in which ASCII is a subset. Windows-1252 and UTF-8 are both ASCII compatible. UTF-16 is not.) Your code sample above shows that you're using byte records, but doesn't explain why this is insufficient.
With that said, if your goal is to get your data into Rust's string data type (String
/&str
), then your only option is to transcode the contents of your CSV data from Windows-1252 to UTF-8. This is necessary because Rust's string data type uses UTF-8 for its in-memory representation. You cannot have a Rust String
/&str
that is Windows-1252 encoded because Windows-1252 is not a subset of UTF-8.
Other comments have recommended the use of the encoding
crate. However, I would instead recommend the use of encoding_rs
, if your use case aligns with the same use cases solved by the Encoding Standard, which is specifically geared towards the web. Fortunately, I believe such an alignment exists.
In order to satisfy your criteria for reading CSV data in a streaming fashion without first loading the entire contents into memory, you need to use a wrapper around the encoding_rs
crate that implements streaming decoding for you. The encoding_rs_io
crate provides this for you. (It's used inside of ripgrep to do fast streaming decoding before searching UTF-8.)
Here is an example program that puts all of the above together, using Rust 2018:
use std::fs::File;
use std::process;
use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;
fn main() {
if let Err(err) = try_main() {
eprintln!("{}", err);
process::exit(1);
}
}
fn try_main() -> csv::Result<()> {
let file = File::open("test.csv")?;
let transcoded = DecodeReaderBytesBuilder::new()
.encoding(Some(WINDOWS_1252))
.build(file);
let mut rdr = csv::ReaderBuilder::new()
.delimiter(b';')
.from_reader(transcoded);
for result in rdr.records() {
let r = result?;
println!("{:?}", r);
}
Ok(())
}
with the Cargo.toml
:
[package]
name = "so53826986"
version = "0.1.0"
edition = "2018"
[dependencies]
csv = "1"
encoding_rs = "0.8.13"
encoding_rs_io = "0.1.3"
And the output:
$ cargo run --release
Compiling so53826986 v0.1.0 (/tmp/so53826986)
Finished release [optimized] target(s) in 0.63s
Running `target/release/so53826986`
StringRecord(["Café", "au", "lait"])
In particular, if you swap out rdr.records()
for rdr.byte_records()
, then we can see more clearly what happened:
$ cargo run --release
Compiling so53826986 v0.1.0 (/tmp/so53826986)
Finished release [optimized] target(s) in 0.61s
Running `target/release/so53826986`
ByteRecord(["Caf\\xc3\\xa9", "au", "lait"])
Namely, your input contained Caf\xE9
, but the byte record now contains Caf\xC3\xA9
. This is a result of translating the Windows-1252 codepoint value of 233
(encoded as its literal byte, \xE9
) to U+00E9 LATIN SMALL LETTER E WITH ACUTE
, which is UTF-8 encoded as \xC3\xA9
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With