To read a file into memory I use :
val lines = sc.textFile("myLogFile*")
which is of type :
org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12
Reading the Scala doc at : http://spark.apache.org/docs/0.9.1/scala-programming-guide.html#parallelized-collections "Parallelized collections are created by calling SparkContext’s parallelize method on an existing Scala collection (a Seq object)."
This does not seem to apply to an RDD ? Can parallelized processing occur on an RDD ? Do I need to convert the RDD to a Seq object ?
One of the ways that you can achieve parallelism in Spark without using Spark data frames is by using the multiprocessing library. The library provides a thread abstraction that you can use to create concurrent threads of execution. However, by default all of your code will run on the driver node.
parallelize() method is the SparkContext's parallelize method to create a parallelized collection. This allows Spark to distribute the data across multiple nodes, instead of depending on a single node to process the data: Now that we have created ...
Resilient Distributed Datasets (RDDs) RDD as the name suggests are distributed and fault-tolerant and parallel.
"RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and ma- nipulate them using a rich set of operators." Please see this paper.
No you don't need to convert an RDD to a Seq object. All processing on RDDs are done in parallel (depending on how parallel your Spark installation is).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With