Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parallelize an RDD?

To read a file into memory I use :

val lines = sc.textFile("myLogFile*")

which is of type :

org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12

Reading the Scala doc at : http://spark.apache.org/docs/0.9.1/scala-programming-guide.html#parallelized-collections "Parallelized collections are created by calling SparkContext’s parallelize method on an existing Scala collection (a Seq object)."

This does not seem to apply to an RDD ? Can parallelized processing occur on an RDD ? Do I need to convert the RDD to a Seq object ?

like image 269
blue-sky Avatar asked Apr 25 '14 22:04

blue-sky


People also ask

How do I parallelize data in Spark?

One of the ways that you can achieve parallelism in Spark without using Spark data frames is by using the multiprocessing library. The library provides a thread abstraction that you can use to create concurrent threads of execution. However, by default all of your code will run on the driver node.

What is parallelize method?

parallelize() method is the SparkContext's parallelize method to create a parallelized collection. This allows Spark to distribute the data across multiple nodes, instead of depending on a single node to process the data: Now that we have created ...


1 Answers

Resilient Distributed Datasets (RDDs) RDD as the name suggests are distributed and fault-tolerant and parallel.

"RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and ma- nipulate them using a rich set of operators." Please see this paper.

No you don't need to convert an RDD to a Seq object. All processing on RDDs are done in parallel (depending on how parallel your Spark installation is).

like image 92
Soumya Simanta Avatar answered Sep 21 '22 10:09

Soumya Simanta