I have a list of tuples, (String, String, Int, Double) that I want to convert to Spark RDD.
In general, how do I convert a Scala Iterable[(a1, a2, a3, ..., an)] into a Spark RDD?
Spark Create RDD from Seq or List (using Parallelize) RDD's are generally created by parallelized collection i.e. by taking an existing collection from driver program (scala, python e.t.c) and passing it to SparkContext's parallelize() method.
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
Additionally, cogroup() can work on three or more RDDs at once.
Dataset is a strong typed Dataframe, so both Dataset and Dataframe could use . rdd to convert to a RDD.
There are a few ways to do this, but the most straightforward way is just to use Spark Context:
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.SparkContext._
sc.parallelize(YourIterable.toList)
I think sc.Parallelize needs a conversion to List, but it will preserve your structure, thus you will still get a RDD[String,String,Int,Double]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With