pyspark error: AttributeError: 'SparkSession' object has no attribute 'parallelize'

Tags:

I am using pyspark on Jupyter notebook. Here is how Spark setup:

import findspark findspark.init(spark_home='/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive', python_path='python2.7')      import pyspark     from pyspark.sql import *      sc = pyspark.sql.SparkSession.builder.master("yarn-client").config("spark.executor.memory", "2g").config('spark.driver.memory', '1g').config('spark.driver.cores', '4').enableHiveSupport().getOrCreate()      sqlContext = SQLContext(sc)

Then when I do:

spark_df = sqlContext.createDataFrame(df_in)

where df_in is a pandas dataframe. I then got the following errors:

--------------------------------------------------------------------------- AttributeError                            Traceback (most recent call last) <ipython-input-9-1db231ce21c9> in <module>() ----> 1 spark_df = sqlContext.createDataFrame(df_in)   /home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/context.pyc in createDataFrame(self, data, schema, samplingRatio)     297         Py4JJavaError: ...     298         """ --> 299         return self.sparkSession.createDataFrame(data, schema, samplingRatio)     300      301     @since(1.3)  /home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio)     520             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)     521         else: --> 522             rdd, schema = self._createFromLocal(map(prepare, data), schema)     523         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())     524         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())  /home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/session.pyc in _createFromLocal(self, data, schema)     400         # convert python objects to sql data     401         data = [schema.toInternal(row) for row in data] --> 402         return self._sc.parallelize(data), schema     403      404     @since(2.0)  AttributeError: 'SparkSession' object has no attribute 'parallelize'

Does anyone know what I did wrong? Thanks!

299

asked Sep 15 '16 22:09

Edamame

1 Answers

SparkSession is not a replacement for a SparkContext but an equivalent of the SQLContext. Just use it use the same way as you used to use SQLContext:

spark.createDataFrame(...)

and if you ever have to access SparkContext use sparkContext attribute:

spark.sparkContext

so if you need SQLContext for backwards compatibility you can:

SQLContext(sparkContext=spark.sparkContext, sparkSession=spark)

200

answered Sep 21 '22 02:09

zero323

Related questions
                            
                                Check if two types are equal in C++
                            
                                Why do I need to decorate connected slots with pyqtSlot?
                            
                                How to generate a static html file from a swagger documentation?
                            
                                AWS Lambda function using Boto3 timeout
                            
                                React - how to determine if component is stateless/functional?
                            
                                Tmux - Tmux true color is not working properly
                            
                                How to generate jaxb classes from xsd using gradle, jaxb and xjc, classes should have XmlRootElement
                            
                                how do I convert text to jsonB
                            
                                .gitignore not ignoring node_modules
                            
                                Making a dynamic array that accepts any type in C
                            
                                MySQL If statement with multiple conditions
                            
                                Using map() for columns in a pandas dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With