Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark error: 'DataFrame' object has no attribute 'map'

I am using pyspark 2.0 to create a DataFrame object by reading a csv using:

data = spark.read.csv('data.csv', header=True)

I find the type of the data using

type(data)

The result is

pyspark.sql.dataframe.DataFrame

I am trying to convert the some columns in data to LabeledPoint in order to apply a classification.

from pyspark.sql.types import *    
from pyspark.sql.functions import loc
from pyspark.mllib.regression import LabeledPoint

data.select(['label','features']).
              map(lambda row:LabeledPoint(row.label, row.features))

I came across with this problem:

AttributeError: 'DataFrame' object has no attribute 'map'

Any idea on the error? Is there a way to generate a LabelPoint from DataFrame in order to perform classification?

like image 902
Xi Liang Avatar asked Sep 08 '16 01:09

Xi Liang


People also ask

What is withColumn in PySpark?

In PySpark, the withColumn() function is widely used and defined as the transformation function of the DataFrame which is further used to change the value, convert the datatype of an existing column, create the new column etc.

How do you convert PySpark DF to pandas DF?

Convert PySpark Dataframe to Pandas DataFramePySpark DataFrame provides a method toPandas() to convert it to Python Pandas DataFrame. toPandas() results in the collection of all records in the PySpark DataFrame to the driver program and should be done only on a small subset of the data.

What is RDD map?

RDD map() transformation is used to apply any complex operations like adding a column, updating a column, transforming the data e.t.c, the output of map transformations would always have the same number of records as input.


1 Answers

Use .rdd.map:

>>> data.select(...).rdd.map(...)

DataFrame.map has been removed in Spark 2.

like image 126
user6022341 Avatar answered Sep 23 '22 16:09

user6022341