Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting a Scala Iterable[tuple] to RDD

I have a list of tuples, (String, String, Int, Double) that I want to convert to Spark RDD.

In general, how do I convert a Scala Iterable[(a1, a2, a3, ..., an)] into a Spark RDD?

like image 228
oikonomiyaki Avatar asked Oct 22 '15 15:10

oikonomiyaki


People also ask

Which method can be used in Spark to convert a Scala collection into a RDD?

Spark Create RDD from Seq or List (using Parallelize) RDD's are generally created by parallelized collection i.e. by taking an existing collection from driver program (scala, python e.t.c) and passing it to SparkContext's parallelize() method.

How to create RDDs?

There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

How many RDDs can cogroup can work at once?

Additionally, cogroup() can work on three or more RDDs at once.

Can we convert dataset to RDD?

Dataset is a strong typed Dataframe, so both Dataset and Dataframe could use . rdd to convert to a RDD.


1 Answers

There are a few ways to do this, but the most straightforward way is just to use Spark Context:

import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.SparkContext._

sc.parallelize(YourIterable.toList)

I think sc.Parallelize needs a conversion to List, but it will preserve your structure, thus you will still get a RDD[String,String,Int,Double]

like image 101
GameOfThrows Avatar answered Oct 03 '22 22:10

GameOfThrows