I wanted to convert the spark data frame to add using the code below:
from pyspark.mllib.clustering import KMeans spark_df = sqlContext.createDataFrame(pandas_df) rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data])) model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")
The detailed error message is:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-11-a19a1763d3ac> in <module>() 1 from pyspark.mllib.clustering import KMeans 2 spark_df = sqlContext.createDataFrame(pandas_df) ----> 3 rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data])) 4 model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random") /home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in __getattr__(self, name) 842 if name not in self.columns: 843 raise AttributeError( --> 844 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name)) 845 jc = self._jdf.apply(name) 846 return Column(jc) AttributeError: 'DataFrame' object has no attribute 'map'
Does anyone know what I did wrong here? Thanks!
RDD map() transformation is used to apply any complex operations like adding a column, updating a column, transforming the data e.t.c, the output of map transformations would always have the same number of records as input.
rdd is used to convert PySpark DataFrame to RDD; there are several transformations that are not available in DataFrame but present in RDD hence you often required to convert PySpark DataFrame to RDD. Since PySpark 1.3, it provides a property .
Convert PySpark Dataframe to Pandas DataFramePySpark DataFrame provides a method toPandas() to convert it to Python Pandas DataFrame. toPandas() results in the collection of all records in the PySpark DataFrame to the driver program and should be done only on a small subset of the data.
You can't map
a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map()
. Prior to Spark 2.0, spark_df.map
would alias to spark_df.rdd.map()
. With Spark 2.0, you must explicitly call .rdd
first.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With