How to parallelize an RDD?

Tags:

apache-spark

To read a file into memory I use :

val lines = sc.textFile("myLogFile*")

which is of type :

org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12

Reading the Scala doc at : http://spark.apache.org/docs/0.9.1/scala-programming-guide.html#parallelized-collections "Parallelized collections are created by calling SparkContext’s parallelize method on an existing Scala collection (a Seq object)."

This does not seem to apply to an RDD ? Can parallelized processing occur on an RDD ? Do I need to convert the RDD to a Seq object ?

269

asked Apr 25 '14 22:04

blue-sky

1 Answers

Resilient Distributed Datasets (RDDs) RDD as the name suggests are distributed and fault-tolerant and parallel.

"RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and ma- nipulate them using a rich set of operators." Please see this paper.

No you don't need to convert an RDD to a Seq object. All processing on RDDs are done in parallel (depending on how parallel your Spark installation is).

answered Sep 21 '22 10:09

Soumya Simanta

Related questions
                            
                                Overriding subclass methods with subclassed arguments?
                            
                                Ensuring hygiene in the absence of reify
                            
                                Function arguments: upper bound vs parent class as argument?
                            
                                Carry on information about previous computations
                            
                                Parsing a file with BodyParser in Scala Play20 with new lines
                            
                                Comparing date strings with actual dates in Scala
                            
                                What's a shebang line for Scala that doesn't corrupt mimetype?
                            
                                Best way to handle object's fields validation => Either / Try (scala 2.10) / ValidationNEL (scalaz)
                            
                                What is a good way to handle default values with spray-json
                            
                                Unit in scala - Not able to understand
                            
                                Generic getter method for tuples in Scala which preserves dynamic type?
                            
                                Importing .jar files into Scala environment
                            
                                Next element from stream in Scala
                            
                                How to define generic type in Scala?
                            
                                How can I express *finally* equivalent for a Scala's Try?
                            
                                Play 2.1.x default catch all route
                            
                                Differences between Akka Scala and Java [closed]
                            
                                Format JSON string in scala
                            
                                How to share code between project and build definition project in SBT
                            
                                Scala macro to print code?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With