Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AttributeError: 'DataFrame' object has no attribute 'map'

I wanted to convert the spark data frame to add using the code below:

from pyspark.mllib.clustering import KMeans spark_df = sqlContext.createDataFrame(pandas_df) rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data])) model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random") 

The detailed error message is:

--------------------------------------------------------------------------- AttributeError                            Traceback (most recent call last) <ipython-input-11-a19a1763d3ac> in <module>()       1 from pyspark.mllib.clustering import KMeans       2 spark_df = sqlContext.createDataFrame(pandas_df) ----> 3 rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))       4 model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")  /home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in __getattr__(self, name)     842         if name not in self.columns:     843             raise AttributeError( --> 844                 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))     845         jc = self._jdf.apply(name)     846         return Column(jc)  AttributeError: 'DataFrame' object has no attribute 'map' 

Does anyone know what I did wrong here? Thanks!

like image 587
Edamame Avatar asked Sep 16 '16 15:09

Edamame


People also ask

What is RDD map?

RDD map() transformation is used to apply any complex operations like adding a column, updating a column, transforming the data e.t.c, the output of map transformations would always have the same number of records as input.

Can we convert DataFrame to RDD?

rdd is used to convert PySpark DataFrame to RDD; there are several transformations that are not available in DataFrame but present in RDD hence you often required to convert PySpark DataFrame to RDD. Since PySpark 1.3, it provides a property .

How do you convert PySpark DF to pandas DF?

Convert PySpark Dataframe to Pandas DataFramePySpark DataFrame provides a method toPandas() to convert it to Python Pandas DataFrame. toPandas() results in the collection of all records in the PySpark DataFrame to the driver program and should be done only on a small subset of the data.


1 Answers

You can't map a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map(). Prior to Spark 2.0, spark_df.map would alias to spark_df.rdd.map(). With Spark 2.0, you must explicitly call .rdd first.

like image 154
David Avatar answered Sep 19 '22 20:09

David