I am learning Spark in Python and wondering can anyone explain the difference between the action <code>foreach()</code> and transformation <code>map()</code>? <code>rdd.map()</code> returns a new RDD, like the original map function in Python. However, I want to see a <code>rdd.foreach()</code> function and understand the differences. Thanks!

A very simple example would be <code>rdd.foreach(print)</code> which would print the value of each row in the RDD but not modify the RDD in any way. For example, this produces an RDD with the numbers 1 - 10: <pre class="prettyprint"><code>>>> rdd = sc.parallelize(xrange(0, 10)).map(lambda x: x + 1) >>> rdd.take(10) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] </code></pre> The <code>map</code> call computed a new value for each row and it returned it so that I get a new RDD. However, if I used <code>foreach</code> that would be useless because <code>foreach</code> doesn't modify the rdd in any way: <pre class="prettyprint"><code>>>> rdd = sc.parallelize(range(0, 10)).foreach(lambda x: x + 1) >>> type(rdd) <class 'NoneType'> </code></pre> Conversely, calling <code>map</code> on a function that returns <code>None</code> like <code>print</code> isn't very useful: <pre class="prettyprint"><code>>>> rdd = sc.parallelize(range(0, 10)).map(print) >>> rdd.take(10) 0 1 2 3 4 5 6 7 8 9 [None, None, None, None, None, None, None, None, None, None] </code></pre> The <code>print</code> call returns <code>None</code> so mapping that just gives you a bunch of <code>None</code> values and you didn't want those values and you didn't want to save them so returning them is a waste. (Note the lines with <code>1</code>, <code>2</code>, etc. are the <code>print</code> being executed and they don't show up until you call <code>take</code> since the RDD is executed lazily. However the contents of the RDD are just a bunch of <code>None</code>. More simply, call <code>map</code> if you care about the return value of the function. Call <code>foreach</code> if you don't.

Difference between RDD.foreach() and RDD.map()

1 Answers

A very simple example would be rdd.foreach(print) which would print the value of each row in the RDD but not modify the RDD in any way.

For example, this produces an RDD with the numbers 1 - 10:

>>> rdd = sc.parallelize(xrange(0, 10)).map(lambda x: x + 1)
>>> rdd.take(10)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

The map call computed a new value for each row and it returned it so that I get a new RDD. However, if I used foreach that would be useless because foreach doesn't modify the rdd in any way:

>>> rdd = sc.parallelize(range(0, 10)).foreach(lambda x: x + 1)
>>> type(rdd)
<class 'NoneType'>

Conversely, calling map on a function that returns None like print isn't very useful:

>>> rdd = sc.parallelize(range(0, 10)).map(print)
>>> rdd.take(10)
0
1
2
3
4
5
6
7
8
9
[None, None, None, None, None, None, None, None, None, None]

The print call returns None so mapping that just gives you a bunch of None values and you didn't want those values and you didn't want to save them so returning them is a waste. (Note the lines with 1, 2, etc. are the print being executed and they don't show up until you call take since the RDD is executed lazily. However the contents of the RDD are just a bunch of None.

More simply, call map if you care about the return value of the function. Call foreach if you don't.

answered Oct 11 '22 03:10

Oliver Dain

Related questions
                            
                                Graphx Visualization
                            
                                reading json file in pyspark
                            
                                how can i add a timestamp as an extra column to my dataframe
                            
                                Saving contents of df.show() as a string in spark-scala app
                            
                                If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?
                            
                                Spark - How to count number of records by key
                            
                                How spark driver serializes the task that is sent to executors?
                            
                                Pyspark changing type of column from date to string
                            
                                How to add my own function as a custom stage in a ML pyspark Pipeline? [duplicate]
                            
                                How to get rows from DF that contain value None in pyspark (spark)
                            
                                Spark import of Parquet files converts strings to bytearray
                            
                                Spark-submit / spark-shell > difference between yarn-client and yarn-cluster mode
                            
                                Access Array column in Spark
                            
                                get TopN of all groups after group by using Spark DataFrame
                            
                                Spark merge dataframe with mismatching schemas without extra disk IO
                            
                                Spark: Explode a dataframe array of structs and append id
                            
                                How do I run the Spark decision tree with a categorical feature set using Scala?
                            
                                What does Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED mean in pyspark?
                            
                                What is version library spark supported SparkSession
                            
                                Scala Spark contains vs. does not contain

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference between RDD.foreach() and RDD.map()

Tags:

apache-spark

pyspark

Chenxi Zeng

People also ask

1 Answers

Oliver Dain

Recent Activity

Donate For Us