Spark - csv read option

Tags:

apache-spark

I'm using spark 2.1 and tried to read csv file.

compile group: 'org.scala-lang', name: 'scala-library', version: '2.11.1'
compile group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.1.0'

Here is my code.

Click to copy

import java.io.{BufferedWriter, File, FileWriter}
import java.sql.{Connection, DriverManager}
import net.sf.log4jdbc.sql.jdbcapi.ConnectionSpy
import org.apache.spark.sql.{DataFrame, SparkSession, Column, SQLContext}
import org.apache.spark.sql.functions._
import org.postgresql.jdbc.PgConnection

spark.read
    .option("charset", "utf-8")
    .option("header", "true")
    .option("quote", "\"")
    .option("delimiter", ",")
    .csv(...)

It works well. Problem is that spark read(DataFrameReader) option key is not same as reference (link). reference said I should use 'encoding' for encoding but not working, but charset work well. Is reference is wrong?

213

asked Jul 21 '17 02:07

J.Done

1 Answers

You can see here:

Click to copy

val charset = parameters.getOrElse("encoding", 
       parameters.getOrElse("charset",StandardCharsets.UTF_8.name()))

Both encoding and charset are valid options, and you should have no problem using either when setting the encoding.

Charset is simply there for legacy support from when the spark csv code was from the databricks spark csv project, which has been merged into the spark project since 2.x. That is also where delimiter (now sep) comes from.

Note the default values for the csv reader, you can remove charset, quote, and delimiter from your code, since you are just using the default values. Leaving you with simply:

Click to copy

spark.read.option("header", "true").csv(...)

135

answered Nov 06 '22 16:11

soote

Related questions
                            
                                Spark streaming reads file twice from NFS
                            
                                NotSerializableException when sorting in Spark
                            
                                How to score all user-product combinations in Spark MatrixFactorizationModel?
                            
                                Resources/Documentation on how does the failover process work for the Spark Driver (and its YARN Container) in yarn-cluster mode
                            
                                Spark can't pickle method_descriptor
                            
                                In-order processing in Spark Streaming
                            
                                Spark-Shell: Howto define JAR loading order
                            
                                Lambda Architecture with Apache Spark
                            
                                Spark DataFrames with Parquet and Partitioning
                            
                                Spark metrics on wordcount example
                            
                                Spark: Input a vector
                            
                                Spark example program runs very slow
                            
                                Data shuffle for Hive and Spark window function
                            
                                How to build a sparse matrix in PySpark?
                            
                                Kryo: deserialize old version of class
                            
                                Group by and order by in Spark SQL
                            
                                CodeGen grows beyond 64 KB error when normalizing large PySpark dataframe
                            
                                How to have Apache Spark running on GPU?
                            
                                Read parquet into spark dataset ignoring missing fields [duplicate]
                            
                                How to get the number of records written (using DataFrameWriter's save operation)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark - csv read option

Tags:

apache-spark

J.Done

People also ask

1 Answers

soote

Recent Activity

Donate For Us