Modify collection inside a Spark RDD foreach

Tags:

I'm trying to add elements to a map while iterating the elements of an RDD. I'm not getting any errors, but the modifications are not happening.

It all works fine adding directly or iterating other collections:

scala> val myMap = new collection.mutable.HashMap[String,String]
myMap: scala.collection.mutable.HashMap[String,String] = Map()

scala> myMap("test1")="test1"

scala> myMap
res44: scala.collection.mutable.HashMap[String,String] = Map(test1 -> test1)

scala> List("test2", "test3").foreach(w => myMap(w) = w)

scala> myMap
res46: scala.collection.mutable.HashMap[String,String] = Map(test2 -> test2, test1 -> test1, test3 -> test3)

But when I try to do the same from an RDD:

scala> val fromFile = sc.textFile("tests.txt")
...
scala> fromFile.take(3)
...
res48: Array[String] = Array(test4, test5, test6)

scala> fromFile.foreach(w => myMap(w) = w)
scala> myMap
res50: scala.collection.mutable.HashMap[String,String] = Map(test2 -> test2, test1 -> test1, test3 -> test3)

I've tried printing the contents of the map as it was before the foreach to make sure the variable is the same, and it prints correctly:

fromFile.foreach(w => println(myMap("test1")))
...
test1
test1
test1
...

I've also printed the modified element of the map inside the foreach code and it prints as modified, but when the operation is completed, the map seems unmodified.

scala> fromFile.foreach({w => myMap(w) = w; println(myMap(w))})
...
test4
test5
test6
...
scala> myMap
res55: scala.collection.mutable.HashMap[String,String] = Map(test2 -> test2, test1 -> test1, test3 -> test3)

Converting the RDD to an array (collect) also works fine:

fromFile.collect.foreach(w => myMap(w) = w)
scala> myMap
res89: scala.collection.mutable.HashMap[String,String] = Map(test2 -> test2, test5 -> test5, test1 -> test1, test4 -> test4, test6 -> test6, test3 -> test3)

Is this a context problem? Am I accessing a copy of the data that is being modified somewhere else?

452

asked Apr 30 '14 17:04

palako

1 Answers

It becomes clearer when running on a Spark cluster (not a single machine). The RDD is now spread over several machines. When you call foreach, you tell each machine what to do with the piece of the RDD that it has. If you refer to any local variables (like myMap), they get serialized and sent to the machines, so they can use it. But nothing comes back. So your original copy of myMap is unaffected.

I think this answers your question, but obviously you are trying to accomplish something and you will not be able to get there this way. Feel free to explain here or in a separate question what you are trying to do, and I will try to help.

171

answered Oct 01 '22 17:10

Daniel Darabos

Related questions
                            
                                how to use Regexp_replace in spark
                            
                                Spark Implicit $ for DataFrame
                            
                                What is the most straightforward way to parse JSON in Scala?
                            
                                How to determine whether a type parameter is a subtype of a trait?
                            
                                Finding the most frequent/common element in a collection?
                            
                                Changing Ivy Cache Location for sbt projects in IntelliJ IDEA?
                            
                                Can I use a scala class which implements a java interface from Java?
                            
                                How to copy iterator in Scala?
                            
                                Alternatives to using "var" for state with actors?
                            
                                How to set an expected exception using Scala and JUnit 4
                            
                                Scala XML: brace escapes in attributes
                            
                                Ending a for-comprehension loop when a check on one of the items returns false
                            
                                idiomatic "get or else update" for immutable.Map?
                            
                                How to use sbt-eclipse to create Eclipse project files of a project?
                            
                                Stackoverflow due to long RDD Lineage
                            
                                How to check version of Spark and Scala in Zeppelin?
                            
                                Using Scala as a scripting language from Eclipse
                            
                                Scala regex Named Capturing Groups
                            
                                How to create a finite iterator with contents being a result of an expression?
                            
                                In scala multiple inheritance, how to resolve conflicting methods with same signature but different return type?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Modify collection inside a Spark RDD foreach

Tags:

scala

apache-spark

rdd

palako

People also ask

1 Answers

Daniel Darabos

Recent Activity

Donate For Us