Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to parse CSV file with UTF-8 encoding?

I use Spark 2.1.

input csv file contains unicode characters like shown below


While parsing this csv file, the output is shown like below


I use MS Excel 2010 to view files.

The Java code used is

public void TestCSV() throws IOException {
    String inputPath = "/user/jpattnaik/1945/unicode.csv";
    String outputPath = "file:\\C:\\Users\\jpattnaik\\ubuntu-bkp\\backup\\bug-fixing\\1945\\output-csv";
      .option("inferSchema", "true")
      .option("header", "true")
      .option("encoding", "UTF-8")
      .option("header", "true")
      .option("encoding", "UTF-8")

How can I get the output same as input?

like image 802
Jyoti Ranjan Avatar asked May 16 '17 13:05

Jyoti Ranjan

3 Answers

I was able to read ISO-8859-1 using spark but when I store the same data to S3/hdfs back and read it, the format is converting to UTF-8.

ex: é to é

val df = spark.read.format("csv").option("delimiter", ",").option("ESCAPE quote", '"'). option("header",true).option("encoding", "ISO-8859-1").load("s3://bucket/folder")
like image 119
Saida Avatar answered Oct 13 '22 02:10


My guess is that the input file is not in UTF-8 and hence you get the incorrect characters.

My recommendation would be to write a pure Java application (with no Spark at all) and see if reading and writing gives the same results with UTF-8 encoding.

like image 39
Jacek Laskowski Avatar answered Oct 13 '22 02:10

Jacek Laskowski

.option('encoding', 'ISO-8859-1') worked for me. Acute, caret, cedilla accents among others appeared correctly.

like image 35
Diogo Féria Avatar answered Oct 13 '22 01:10

Diogo Féria