Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I can't understand 'RDD.map{ case (A, B) => A } ' in Scala Spark

I'm relatively new to Scala Spark. I've got a question with map method.

My understanding: map is a RDD method, it accepts a function as its parameter like:map(line => line.split(","))

I found really hard to understand this kind of expression.

val uniqueUsers = data.map { case (user, product, price) => user }.distinct().count()

Could someone explain two things for me:

  1. why {} is used not ()
  2. can I regard case (user, product, price) => user as a function? If not, what is it?

Thank you in advance.

like image 480
Phil Avatar asked May 28 '17 03:05

Phil


People also ask

What does RDD map do?

map :It returns a new RDD by applying a function to each element of the RDD. Function in map can return only one item.

How does map function work in Spark?

Spark Map function takes one element as input process it according to custom code (specified by the developer) and returns one element at a time. Map transforms an RDD of length N into another RDD of length N. The input and output RDDs will typically have the same number of records.

What is map in Spark scala?

Spark map() is a transformation operation that is used to apply the transformation on every element of RDD, DataFrame, and Dataset and finally returns a new RDD/Dataset respectively. In this article, you will learn the syntax and usage of the map() transformation with an RDD & DataFrame example.

What is RDD in Spark with example?

RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.


1 Answers

In Scala the syntax { case arg => body } is a Partial Function.

definition of Partial Function from Scala-Doc

A partial function of type PartialFunction[A, B] is a unary function where the domain does not necessarily include all values of type A. The function isDefinedAt allows to test dynamically if a value is in the domain of the function.

In your case of { case (user, product, price) => user } you have defined a Partial Function that a takes a Tuple3 object as input. This Tuple3 object is unpacked as 3 variables of user, product and price and the function body just returns the user.

so to answer your questions

  1. why {} is used not ()

    because partial functions have to be wrapped by curly braces.

  2. an I regard case (user, product, price) => user as a function? If not, what is it?

    yes. { case (user, product, price) => user } is a special type of Function called PartialFunction which is defined only for specific inputs and is not defined for other inputs. In your case the PartialFunction is defined only for input of Tuple3[T1,T2,T3] where T1,T2, and T3 are types of user,product and price objects

like image 184
rogue-one Avatar answered Sep 20 '22 06:09

rogue-one