Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse CSV file with UTF-8 encoding?

I use Spark 2.1.

input csv file contains unicode characters like shown below

unicode-input-csv

While parsing this csv file, the output is shown like below

unicode-output-csv

I use MS Excel 2010 to view files.

The Java code used is

@Test
public void TestCSV() throws IOException {
    String inputPath = "/user/jpattnaik/1945/unicode.csv";
    String outputPath = "file:\\C:\\Users\\jpattnaik\\ubuntu-bkp\\backup\\bug-fixing\\1945\\output-csv";
    getSparkSession()
      .read()
      .option("inferSchema", "true")
      .option("header", "true")
      .option("encoding", "UTF-8")
      .csv(inputPath)
      .write()
      .option("header", "true")
      .option("encoding", "UTF-8")
      .mode(SaveMode.Overwrite)
      .csv(outputPath);
}

How can I get the output same as input?

like image 802
Jyoti Ranjan Avatar asked May 16 '17 13:05

Jyoti Ranjan


3 Answers

I was able to read ISO-8859-1 using spark but when I store the same data to S3/hdfs back and read it, the format is converting to UTF-8.

ex: é to é

val df = spark.read.format("csv").option("delimiter", ",").option("ESCAPE quote", '"'). option("header",true).option("encoding", "ISO-8859-1").load("s3://bucket/folder")
like image 119
Saida Avatar answered Oct 13 '22 02:10

Saida


My guess is that the input file is not in UTF-8 and hence you get the incorrect characters.

My recommendation would be to write a pure Java application (with no Spark at all) and see if reading and writing gives the same results with UTF-8 encoding.

like image 39
Jacek Laskowski Avatar answered Oct 13 '22 02:10

Jacek Laskowski


.option('encoding', 'ISO-8859-1') worked for me. Acute, caret, cedilla accents among others appeared correctly.

like image 35
Diogo Féria Avatar answered Oct 13 '22 01:10

Diogo Féria