I use Spark 2.1.
input csv file contains unicode characters like shown below
While parsing this csv file, the output is shown like below
I use MS Excel 2010 to view files.
The Java code used is
@Test
public void TestCSV() throws IOException {
String inputPath = "/user/jpattnaik/1945/unicode.csv";
String outputPath = "file:\\C:\\Users\\jpattnaik\\ubuntu-bkp\\backup\\bug-fixing\\1945\\output-csv";
getSparkSession()
.read()
.option("inferSchema", "true")
.option("header", "true")
.option("encoding", "UTF-8")
.csv(inputPath)
.write()
.option("header", "true")
.option("encoding", "UTF-8")
.mode(SaveMode.Overwrite)
.csv(outputPath);
}
How can I get the output same as input?
I was able to read ISO-8859-1 using spark but when I store the same data to S3/hdfs back and read it, the format is converting to UTF-8.
ex: é to é
val df = spark.read.format("csv").option("delimiter", ",").option("ESCAPE quote", '"'). option("header",true).option("encoding", "ISO-8859-1").load("s3://bucket/folder")
My guess is that the input file is not in UTF-8
and hence you get the incorrect characters.
My recommendation would be to write a pure Java application (with no Spark at all) and see if reading and writing gives the same results with UTF-8
encoding.
.option('encoding', 'ISO-8859-1') worked for me. Acute, caret, cedilla accents among others appeared correctly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With