Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark error: AttributeError: 'SparkSession' object has no attribute 'parallelize'

Tags:

I am using pyspark on Jupyter notebook. Here is how Spark setup:

import findspark findspark.init(spark_home='/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive', python_path='python2.7')      import pyspark     from pyspark.sql import *      sc = pyspark.sql.SparkSession.builder.master("yarn-client").config("spark.executor.memory", "2g").config('spark.driver.memory', '1g').config('spark.driver.cores', '4').enableHiveSupport().getOrCreate()      sqlContext = SQLContext(sc) 

Then when I do:

spark_df = sqlContext.createDataFrame(df_in) 

where df_in is a pandas dataframe. I then got the following errors:

--------------------------------------------------------------------------- AttributeError                            Traceback (most recent call last) <ipython-input-9-1db231ce21c9> in <module>() ----> 1 spark_df = sqlContext.createDataFrame(df_in)   /home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/context.pyc in createDataFrame(self, data, schema, samplingRatio)     297         Py4JJavaError: ...     298         """ --> 299         return self.sparkSession.createDataFrame(data, schema, samplingRatio)     300      301     @since(1.3)  /home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio)     520             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)     521         else: --> 522             rdd, schema = self._createFromLocal(map(prepare, data), schema)     523         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())     524         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())  /home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/session.pyc in _createFromLocal(self, data, schema)     400         # convert python objects to sql data     401         data = [schema.toInternal(row) for row in data] --> 402         return self._sc.parallelize(data), schema     403      404     @since(2.0)  AttributeError: 'SparkSession' object has no attribute 'parallelize' 

Does anyone know what I did wrong? Thanks!

like image 299
Edamame Avatar asked Sep 15 '16 22:09

Edamame


People also ask

How do you make SparkContext SparkSession in PySpark?

In Spark or PySpark SparkSession object is created programmatically using SparkSession. builder() and if you are using Spark shell SparkSession object “ spark ” is created by default for you as an implicit object whereas SparkContext is retrieved from the Spark session object by using sparkSession. sparkContext .

What is a SparkSession?

SparkSession is the entry point to Spark SQL. It is one of the very first objects you create while developing a Spark SQL application. As a Spark developer, you create a SparkSession using the SparkSession. builder method (that gives you access to Builder API that you use to configure the session).

What is getOrCreate in SparkSession?

getOrCreate() The above is similar to creating an SparkContext with local and creating an SQLContext wrapping it. If you need to create, hive context you can use below code to create spark session with hive support. val sparkSession = SparkSession.

What is Spark context Spark session?

SparkSession vs SparkContext – Since earlier versions of Spark or Pyspark, SparkContext (JavaSparkContext for Java) is an entry point to Spark programming with RDD and to connect to Spark Cluster, Since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset.


1 Answers

SparkSession is not a replacement for a SparkContext but an equivalent of the SQLContext. Just use it use the same way as you used to use SQLContext:

spark.createDataFrame(...) 

and if you ever have to access SparkContext use sparkContext attribute:

spark.sparkContext 

so if you need SQLContext for backwards compatibility you can:

SQLContext(sparkContext=spark.sparkContext, sparkSession=spark) 
like image 200
zero323 Avatar answered Sep 21 '22 02:09

zero323