How to parse CSV file with UTF-8 encoding?

Question

I use Spark 2.1.

input csv file contains unicode characters like shown below

unicode-input-csv

While parsing this csv file, the output is shown like below

unicode-output-csv

I use MS Excel 2010 to view files.

The Java code used is

@Test
public void TestCSV() throws IOException {
    String inputPath = "/user/jpattnaik/1945/unicode.csv";
    String outputPath = "file:\C:\Users\jpattnaik\ubuntu-bkp\backup\bug-fixing\1945\output-csv";
    getSparkSession()
      .read()
      .option("inferSchema", "true")
      .option("header", "true")
      .option("encoding", "UTF-8")
      .csv(inputPath)
      .write()
      .option("header", "true")
      .option("encoding", "UTF-8")
      .mode(SaveMode.Overwrite)
      .csv(outputPath);
}

How can I get the output same as input?

Saida · Accepted Answer

I was able to read ISO-8859-1 using spark but when I store the same data to S3/hdfs back and read it, the format is converting to UTF-8.

ex: é to Ã©

val df = spark.read.format("csv").option("delimiter", ",").option("ESCAPE quote", '"'). option("header",true).option("encoding", "ISO-8859-1").load("s3://bucket/folder")

Jacek Laskowski · Answer

My guess is that the input file is not in UTF-8 and hence you get the incorrect characters.

My recommendation would be to write a pure Java application (with no Spark at all) and see if reading and writing gives the same results with UTF-8 encoding.

Diogo Féria · Answer

.option('encoding', 'ISO-8859-1') worked for me. Acute, caret, cedilla accents among others appeared correctly.

How to parse CSV file with UTF-8 encoding?

Tags:

csv

unicode

apache-spark

Jyoti Ranjan

3 Answers

Saida

Jacek Laskowski

Diogo Féria

Recent Activity

Donate For Us

How to parse CSV file with UTF-8 encoding?

Tags:

csv

unicode

apache-spark

Jyoti Ranjan

3 Answers

Saida

Jacek Laskowski

Diogo Féria

Related questions

Recent Activity

Donate For Us