I am only trying to read a textfile into a pyspark RDD, and I am noticing huge differences between sqlContext.read.load
and sqlContext.read.text
.
s3_single_file_inpath='s3a://bucket-name/file_name'
indata = sqlContext.read.load(s3_single_file_inpath, format='com.databricks.spark.csv', header='true', inferSchema='false',sep=',')
indata = sqlContext.read.text(s3_single_file_inpath)
The sqlContext.read.load
command above fails with
Py4JJavaError: An error occurred while calling o227.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
But the second one succeeds?
Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load
including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?
Your answer sparkContext is a Scala implementation entry point and JavaSparkContext is a java wrapper of sparkContext. SQLContext is entry point of SparkSQL which can be received from sparkContext.
To read an input text file to RDD, we can use SparkContext. textFile() method. In this tutorial, we will learn the syntax of SparkContext. textFile() method, and how to use in a Spark Application to load data from a text file to RDD with the help of Java and Python examples.
Reading multiple CSV files into RDD Spark RDD's doesn't have a method to read csv file formats hence we will use textFile() method to read csv file like any other text file into RDD and split the record based on comma, pipe or any other delimiter.
Why is difference between sqlContext.read.load and sqlContext.read.text?
sqlContext.read.load
assumes parquet
as the data source format while sqlContext.read.text
assumes text
format.
With sqlContext.read.load
you can define the data source format using format
parameter.
Depending on the version of Spark 1.6 vs 2.x you may or may not load an external Spark package to have support for csv format.
As of Spark 2.0 you no longer have to load spark-csv Spark package since (quoting the official documentation):
NOTE: This functionality has been inlined in Apache Spark 2.x. This package is in maintenance mode and we only accept critical bug fixes.
That would explain why you got confused as you may have been using Spark 1.6.x and have not loaded the Spark package to have csv
support.
Now, I am confused by this because all of the resources I see online say to use
sqlContext.read.load
including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
https://spark.apache.org/docs/1.6.1/sql-programming-guide.html is for Spark 1.6.1 when spark-csv
Spark package was not part of Spark. It happened in Spark 2.0.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?
There's none actually iff you use Spark 2.x.
If however you use Spark 1.6.x, spark-csv
has to be loaded separately using --packages
option (as described in Using with Spark shell):
This package can be added to Spark using the
--packages
command line option. For example, to include it when starting the spark shell
As a matter of fact, you can still use com.databricks.spark.csv
format explicitly in Spark 2.x as it's recognized internally.
The difference is:
text
is a built-in input format in Spark 1.6com.databricks.spark.csv
is a third party package in Spark 1.6To use third party Spark CSV (no longer needed in Spark 2.0) you have to follow the instructions on spark-csv
site, for example provide
--packages com.databricks:spark-csv_2.10:1.5.0
argument with spark-submit
/ pyspark
commands.
Beyond that sqlContext.read.formatName(...)
is a syntactic sugar for sqlContext.read.format("formatName")
and sqlContext.read.load(..., format=formatName)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With