Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between RDD.foreach() and RDD.map()

I am learning Spark in Python and wondering can anyone explain the difference between the action foreach() and transformation map()?

rdd.map() returns a new RDD, like the original map function in Python. However, I want to see a rdd.foreach() function and understand the differences. Thanks!

like image 220
Chenxi Zeng Avatar asked Dec 29 '16 23:12

Chenxi Zeng


People also ask

What is RDD map?

RDD map() transformation is used to apply any complex operations like adding a column, updating a column, transforming the data e.t.c, the output of map transformations would always have the same number of records as input.

How do you use foreach in PySpark RDD?

ForEach partition is also used to apply to each and every partition in RDD. We can create a function and pass it with for each loop in pyspark to apply it over all the functions in Spark. This is an action operation in Spark used for Data processing in Spark.

What is foreach in Spark?

In Spark, foreach() is an action operation that is available in RDD, DataFrame, and Dataset to iterate/loop over each element in the dataset, It is similar to for with advance concepts.

What is difference between map and flatMap in Spark?

map and flatMap are similar, in the sense they take a line from the input RDD and apply a function on it. The way they differ is that the function in map returns only one element, while function in flatMap can return a list of elements (0 or more) as an iterator. Also, the output of the flatMap is flattened.


1 Answers

A very simple example would be rdd.foreach(print) which would print the value of each row in the RDD but not modify the RDD in any way.

For example, this produces an RDD with the numbers 1 - 10:

>>> rdd = sc.parallelize(xrange(0, 10)).map(lambda x: x + 1)
>>> rdd.take(10)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

The map call computed a new value for each row and it returned it so that I get a new RDD. However, if I used foreach that would be useless because foreach doesn't modify the rdd in any way:

>>> rdd = sc.parallelize(range(0, 10)).foreach(lambda x: x + 1)
>>> type(rdd)
<class 'NoneType'>

Conversely, calling map on a function that returns None like print isn't very useful:

>>> rdd = sc.parallelize(range(0, 10)).map(print)
>>> rdd.take(10)
0
1
2
3
4
5
6
7
8
9
[None, None, None, None, None, None, None, None, None, None]

The print call returns None so mapping that just gives you a bunch of None values and you didn't want those values and you didn't want to save them so returning them is a waste. (Note the lines with 1, 2, etc. are the print being executed and they don't show up until you call take since the RDD is executed lazily. However the contents of the RDD are just a bunch of None.

More simply, call map if you care about the return value of the function. Call foreach if you don't.

like image 62
Oliver Dain Avatar answered Oct 11 '22 03:10

Oliver Dain