Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can't we create an RDD using Spark session

We see that,

Spark context available as 'sc'.
Spark session available as 'spark'.

I read spark session includes spark context, streaming context, hive context ... If so, then why are we not able to create an rdd by using a spark session instead of a spark context.

scala> val a = sc.textFile("Sample.txt")
17/02/17 16:16:14 WARN util.SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
a: org.apache.spark.rdd.RDD[String] = Sample.txt MapPartitionsRDD[1] at textFile at <console>:24

scala> val a = spark.textFile("Sample.txt")
<console>:23: error: value textFile is not a member of org.apache.spark.sql.SparkSession
       val a = spark.textFile("Sample.txt")

As shown above, sc.textFile succeeds in creating an RDD but not spark.textFile.

like image 588
Sudha Avatar asked Feb 17 '17 10:02

Sudha


People also ask

Can we create RDD with Spark session?

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat. Text file RDDs can be created using SparkContext 's textFile method.

What are the three ways we can create an RDD?

There are three ways to create an RDD in Spark. Parallelizing already existing collection in driver program. Referencing a dataset in an external storage system (e.g. HDFS, Hbase, shared file system). Creating RDD from already existing RDDs.

What are the limitations of RDDs?

There are some drawbacks of using RDDs though: RDD code can sometimes be very opaque. Developers might struggle to find out what exactly the code is trying to compute. RDDs cannot be optimized by Spark, as Spark cannot look inside the lambda functions and optimize the operations.

How many Spark sessions can be created?

Spark session should be created only once per spark application. Spark doesn't support this and your job might will fail if you use multiple spark session in the same spark job. Here is the SPARK-2243 where spark has closed the ticket saying it won't fix it.


1 Answers

In earlier versions of spark, spark context was entry point for Spark. As RDD was main API, it was created and manipulated using context API’s.

For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext.

But as DataSet and Dataframe API’s are becoming new standard API’s Spark need an entry point build for them. So in Spark 2.0, Spark have a new entry point for DataSet and Dataframe API’s called as Spark Session.

SparkSession is essentially combination of SQLContext, HiveContext and future StreamingContext.

All the API’s available on those contexts are available on spark session also. Spark session internally has a spark context for actual computation.

sparkContext still contains the method which it had in previous version .

methods of sparkSession can be found here

like image 180
bob Avatar answered Nov 14 '22 14:11

bob