Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to use map on a Spark DataFrame

I recently started experimenting with both Spark and Java. I initially went through the famous WordCountexample using RDD and everything went as expected. Now I am trying to implement my own example but using DataFrames and not RDDs.

So I am reading a dataset from a file with

DataFrame df = sqlContext.read()
        .format("com.databricks.spark.csv")
        .option("inferSchema", "true")
        .option("delimiter", ";")
        .option("header", "true")
        .load(inputFilePath);

and then I try to select a specific column and apply a simple transformation to every row like that

df = df.select("start")
        .map(text -> text + "asd");

But the compilation finds a problem with the second row which I don't fully understand (The start column is inferred as of type string).

Multiple non-overriding abstract methods found in interface scala.Function1

Why is my lambda function treated as a Scala function and what does the error message actually mean?

like image 821
LetsPlayYahtzee Avatar asked Mar 02 '17 16:03

LetsPlayYahtzee


People also ask

Can we use map on DataFrame in Spark?

Spark map() is a transformation operation that is used to apply the transformation on every element of RDD, DataFrame, and Dataset and finally returns a new RDD/Dataset respectively. In this article, you will learn the syntax and usage of the map() transformation with an RDD & DataFrame example.

How does map function work in Spark?

Spark Map function takes one element as input process it according to custom code (specified by the developer) and returns one element at a time. Map transforms an RDD of length N into another RDD of length N. The input and output RDDs will typically have the same number of records.

How do I create a map in Spark?

2.1 Using Spark DataTypes. We can create a map column using createMapType() function on the DataTypes class. This method takes two arguments keyType and valueType as mentioned above and these two arguments should be of a type that extends DataType.


1 Answers

If you use the selectfunction on a dataframe you get a dataframe back. Then you apply a function on the Rowdatatype not the value of the row. Afterwards you should get the value first so you should do the following:

df.select("start").map(el->el.getString(0)+"asd")

But you will get an RDD as return value not a DF

like image 100
jojo_Berlin Avatar answered Sep 21 '22 14:09

jojo_Berlin