I am learning Spark in Python and wondering can anyone explain the difference between the action foreach()
and transformation map()
?
rdd.map()
returns a new RDD, like the original map function in Python. However, I want to see a rdd.foreach()
function and understand the differences. Thanks!
RDD map() transformation is used to apply any complex operations like adding a column, updating a column, transforming the data e.t.c, the output of map transformations would always have the same number of records as input.
ForEach partition is also used to apply to each and every partition in RDD. We can create a function and pass it with for each loop in pyspark to apply it over all the functions in Spark. This is an action operation in Spark used for Data processing in Spark.
In Spark, foreach() is an action operation that is available in RDD, DataFrame, and Dataset to iterate/loop over each element in the dataset, It is similar to for with advance concepts.
map and flatMap are similar, in the sense they take a line from the input RDD and apply a function on it. The way they differ is that the function in map returns only one element, while function in flatMap can return a list of elements (0 or more) as an iterator. Also, the output of the flatMap is flattened.
A very simple example would be rdd.foreach(print)
which would print the value of each row in the RDD but not modify the RDD in any way.
For example, this produces an RDD with the numbers 1 - 10:
>>> rdd = sc.parallelize(xrange(0, 10)).map(lambda x: x + 1)
>>> rdd.take(10)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The map
call computed a new value for each row and it returned it so that I get a new RDD. However, if I used foreach
that would be useless because foreach
doesn't modify the rdd in any way:
>>> rdd = sc.parallelize(range(0, 10)).foreach(lambda x: x + 1)
>>> type(rdd)
<class 'NoneType'>
Conversely, calling map
on a function that returns None
like print
isn't very useful:
>>> rdd = sc.parallelize(range(0, 10)).map(print)
>>> rdd.take(10)
0
1
2
3
4
5
6
7
8
9
[None, None, None, None, None, None, None, None, None, None]
The print
call returns None
so mapping that just gives you a bunch of None
values and you didn't want those values and you didn't want to save them so returning them is a waste. (Note the lines with 1
, 2
, etc. are the print
being executed and they don't show up until you call take
since the RDD is executed lazily. However the contents of the RDD are just a bunch of None
.
More simply, call map
if you care about the return value of the function. Call foreach
if you don't.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With