Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - csv read option

Tags:

apache-spark

I'm using spark 2.1 and tried to read csv file.

compile group: 'org.scala-lang', name: 'scala-library', version: '2.11.1'
compile group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.1.0'

Here is my code.

import java.io.{BufferedWriter, File, FileWriter}
import java.sql.{Connection, DriverManager}
import net.sf.log4jdbc.sql.jdbcapi.ConnectionSpy
import org.apache.spark.sql.{DataFrame, SparkSession, Column, SQLContext}
import org.apache.spark.sql.functions._
import org.postgresql.jdbc.PgConnection

spark.read
    .option("charset", "utf-8")
    .option("header", "true")
    .option("quote", "\"")
    .option("delimiter", ",")
    .csv(...)

It works well. Problem is that spark read(DataFrameReader) option key is not same as reference (link). reference said I should use 'encoding' for encoding but not working, but charset work well. Is reference is wrong?

like image 213
J.Done Avatar asked Jul 21 '17 02:07

J.Done


People also ask

How do I read a CSV file in Spark Databricks?

Apache PySpark provides the "csv("path")" for reading a CSV file into the Spark DataFrame and the "dataframeObj. write. csv("path")" for saving or writing to the CSV file. The Apache PySpark supports reading the pipe, comma, tab, and other delimiters/separator files.

How do I read multiple CSV files in Spark?

Reading multiple CSV files into RDD Spark RDD's doesn't have a method to read csv file formats hence we will use textFile() method to read csv file like any other text file into RDD and split the record based on comma, pipe or any other delimiter.


1 Answers

You can see here:

val charset = parameters.getOrElse("encoding", 
       parameters.getOrElse("charset",StandardCharsets.UTF_8.name()))

Both encoding and charset are valid options, and you should have no problem using either when setting the encoding.

Charset is simply there for legacy support from when the spark csv code was from the databricks spark csv project, which has been merged into the spark project since 2.x. That is also where delimiter (now sep) comes from.

Note the default values for the csv reader, you can remove charset, quote, and delimiter from your code, since you are just using the default values. Leaving you with simply:

spark.read.option("header", "true").csv(...)
like image 135
soote Avatar answered Nov 06 '22 16:11

soote