I am only trying to read a textfile into a pyspark RDD, and I am noticing huge differences between <code>sqlContext.read.load</code> and <code>sqlContext.read.text</code>. <pre class="prettyprint"><code>s3_single_file_inpath='s3a://bucket-name/file_name' indata = sqlContext.read.load(s3_single_file_inpath, format='com.databricks.spark.csv', header='true', inferSchema='false',sep=',') indata = sqlContext.read.text(s3_single_file_inpath) </code></pre> The <code>sqlContext.read.load</code> command above fails with <pre class="prettyprint"><code>Py4JJavaError: An error occurred while calling o227.load. : java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org </code></pre> But the second one succeeds? Now, I am confused by this because all of the resources I see online say to use <code>sqlContext.read.load</code> including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html. It is not clear to me when to use which of these to use when. Is there a clear distinction between these?

<blockquote> Why is difference between sqlContext.read.load and sqlContext.read.text? </blockquote> <code>sqlContext.read.load</code> assumes <code>parquet</code> as the data source format while <code>sqlContext.read.text</code> assumes <code>text</code> format. With <code>sqlContext.read.load</code> you can define the data source format using <code>format</code> parameter. <hr> Depending on the version of Spark 1.6 vs 2.x you may or may not load an external Spark package to have support for csv format. As of Spark 2.0 you no longer have to load spark-csv Spark package since (quoting the official documentation): <blockquote> NOTE: This functionality has been inlined in Apache Spark 2.x. This package is in maintenance mode and we only accept critical bug fixes. </blockquote> That would explain why you got confused as you may have been using Spark 1.6.x and have not loaded the Spark package to have <code>csv</code> support. <blockquote> Now, I am confused by this because all of the resources I see online say to use <code>sqlContext.read.load</code> including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html. </blockquote> https://spark.apache.org/docs/1.6.1/sql-programming-guide.html is for Spark 1.6.1 when <code>spark-csv</code> Spark package was not part of Spark. It happened in Spark 2.0. <hr> <blockquote> It is not clear to me when to use which of these to use when. Is there a clear distinction between these? </blockquote> There's none actually iff you use Spark 2.x. If however you use Spark 1.6.x, <code>spark-csv</code> has to be loaded separately using <code>--packages</code> option (as described in Using with Spark shell): <blockquote> This package can be added to Spark using the <code>--packages</code> command line option. For example, to include it when starting the spark shell </blockquote> <hr> As a matter of fact, you can still use <code>com.databricks.spark.csv</code> format explicitly in Spark 2.x as it's recognized internally.

The difference is: <ul> <li> <code>text</code> is a built-in input format in Spark 1.6</li> <li> <code>com.databricks.spark.csv</code> is a third party package in Spark 1.6</li> </ul> To use third party Spark CSV (no longer needed in Spark 2.0) you have to follow the instructions on <code>spark-csv</code> site, for example provide <pre class="prettyprint"><code> --packages com.databricks:spark-csv_2.10:1.5.0 </code></pre> argument with <code>spark-submit</code> / <code>pyspark</code> commands. Beyond that <code>sqlContext.read.formatName(...)</code> is a syntactic sugar for <code>sqlContext.read.format("formatName")</code> and <code>sqlContext.read.load(..., format=formatName)</code>.

Why is difference between sqlContext.read.load and sqlContext.read.text?

Tags:

apache-spark

apache-spark-sql

spark-csv

pyspark

I am only trying to read a textfile into a pyspark RDD, and I am noticing huge differences between sqlContext.read.load and sqlContext.read.text.

Click to copy

s3_single_file_inpath='s3a://bucket-name/file_name'

indata = sqlContext.read.load(s3_single_file_inpath, format='com.databricks.spark.csv', header='true', inferSchema='false',sep=',')
indata = sqlContext.read.text(s3_single_file_inpath)

The sqlContext.read.load command above fails with

Click to copy

Py4JJavaError: An error occurred while calling o227.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org

But the second one succeeds?

Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.

It is not clear to me when to use which of these to use when. Is there a clear distinction between these?

716

asked Dec 05 '17 02:12

makansij

Video Answer

2 Answers

Why is difference between sqlContext.read.load and sqlContext.read.text?

sqlContext.read.load assumes parquet as the data source format while sqlContext.read.text assumes text format.

With sqlContext.read.load you can define the data source format using format parameter.

Depending on the version of Spark 1.6 vs 2.x you may or may not load an external Spark package to have support for csv format.

As of Spark 2.0 you no longer have to load spark-csv Spark package since (quoting the official documentation):

NOTE: This functionality has been inlined in Apache Spark 2.x. This package is in maintenance mode and we only accept critical bug fixes.

That would explain why you got confused as you may have been using Spark 1.6.x and have not loaded the Spark package to have csv support.

Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.

https://spark.apache.org/docs/1.6.1/sql-programming-guide.html is for Spark 1.6.1 when spark-csv Spark package was not part of Spark. It happened in Spark 2.0.

It is not clear to me when to use which of these to use when. Is there a clear distinction between these?

There's none actually iff you use Spark 2.x.

If however you use Spark 1.6.x, spark-csv has to be loaded separately using --packages option (as described in Using with Spark shell):

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell

As a matter of fact, you can still use com.databricks.spark.csv format explicitly in Spark 2.x as it's recognized internally.

148

answered Sep 29 '22 00:09

Jacek Laskowski

The difference is:

text is a built-in input format in Spark 1.6
com.databricks.spark.csv is a third party package in Spark 1.6

To use third party Spark CSV (no longer needed in Spark 2.0) you have to follow the instructions on spark-csv site, for example provide

Click to copy

 --packages com.databricks:spark-csv_2.10:1.5.0

argument with spark-submit / pyspark commands.

Beyond that sqlContext.read.formatName(...) is a syntactic sugar for sqlContext.read.format("formatName") and sqlContext.read.load(..., format=formatName).

answered Sep 29 '22 00:09

Alper t. Turker

Related questions
                            
                                is there any pyspark function for add next month like DATE_ADD(date, month(int type))
                            
                                What is the use of queryExecution in spark dataframe?
                            
                                Apache Spark UDF that returns dynamic data types
                            
                                How to save bucketed DataFrame?
                            
                                how to list spark-packages added to the spark context?
                            
                                UDF to map words to term Index in Spark
                            
                                how does YARN "Fair Scheduler" work with spark-submit configuration parameter
                            
                                how to change column value in spark sql
                            
                                How to write streaming dataset to Kafka?
                            
                                Kafka with Spark 2.1 Structured Streaming - cannot deserialize
                            
                                I am getting an error while creating a simple RDD in Spark
                            
                                Spark Pipeline error
                            
                                spring autoconfiguration class is missing in META-INF/spring.factories
                            
                                NoClassDefFoundError: Could not initialize XXX class after deploying on spark standalone cluster
                            
                                How to cache partitioned dataset and use in multiple queries?
                            
                                Pyspark udf high memory utilization
                            
                                Enum equivalent in Spark Dataframe/Parquet
                            
                                Cumulative distinct count with Spark SQL
                            
                                pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuild in windows 10
                            
                                How handle categorical features in the latest Random Forest in Spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is difference between sqlContext.read.load and sqlContext.read.text?

Tags:

apache-spark

apache-spark-sql

spark-csv

pyspark

makansij

People also ask

Video Answer

2 Answers

Jacek Laskowski

Alper t. Turker

Recent Activity

Donate For Us