I am using pyspark 2.0 to create a DataFrame object by reading a csv using:
data = spark.read.csv('data.csv', header=True)
I find the type of the data using
type(data)
The result is
pyspark.sql.dataframe.DataFrame
I am trying to convert the some columns in data to LabeledPoint in order to apply a classification.
from pyspark.sql.types import *
from pyspark.sql.functions import loc
from pyspark.mllib.regression import LabeledPoint
data.select(['label','features']).
map(lambda row:LabeledPoint(row.label, row.features))
I came across with this problem:
AttributeError: 'DataFrame' object has no attribute 'map'
Any idea on the error? Is there a way to generate a LabelPoint from DataFrame in order to perform classification?
In PySpark, the withColumn() function is widely used and defined as the transformation function of the DataFrame which is further used to change the value, convert the datatype of an existing column, create the new column etc.
Convert PySpark Dataframe to Pandas DataFramePySpark DataFrame provides a method toPandas() to convert it to Python Pandas DataFrame. toPandas() results in the collection of all records in the PySpark DataFrame to the driver program and should be done only on a small subset of the data.
RDD map() transformation is used to apply any complex operations like adding a column, updating a column, transforming the data e.t.c, the output of map transformations would always have the same number of records as input.
Use .rdd.map
:
>>> data.select(...).rdd.map(...)
DataFrame.map
has been removed in Spark 2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With