There are similar questions already asked. The most similar is this one: Spark: How to split an RDD[T]` into Seq[RDD[T]] and preserve the ordering
However, I do not care about preserving the order. Furthermore, I do not have any ID columns in data. What I care about the most is that each row of data is written into a new RDD only once! For this reason, I cannot use a randomSplit, although I am looking forward to such a simple solution. Traversing the partitioned sparkContext will not work either.
I understand that splitting an RDD into several RDDs makes no sense, since RDDs are already made to be processed across many clusters (thus automatically get split).
However, splitting an RDD is a requirement according to a highly complex business logic I need to implement the spark code with and I cannot have it any other way.
The solution that I have is to select ranges from a big RDD, and then simply put each range into a new RDD. However, this looks like a time-consuming task, and therefore not a good solution.
I would appreciate if anyone could give me a hand with this, and keep it down to a beginner level.
SOLUTION THAT WORKED FOR ME:
val numberOfRows = 10000
indexedRDD = RDD.zipWithIndex
for (FROM <-1 to numOfPartitions){
val tempRDD = indexedRDD.filter(from=> {from._2>from && from._2 < from+numberOfRows}).map(from=>from._1)
}
Can you use data from one of the columns and filter according to that?
You could also make a program with mapPartitionsWithIndex that would take first n rows from each partition for the first RDD and then again mapPartitionsWithIndex and take the rest for the second RDD. If you need an exact number of rows, then you would need to do some math there, but it can be done.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With