Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split an RDD into multiple (smaller) RDDs given a max number of rows per RDD, and without using an ID column

There are similar questions already asked. The most similar is this one: Spark: How to split an RDD[T]` into Seq[RDD[T]] and preserve the ordering

However, I do not care about preserving the order. Furthermore, I do not have any ID columns in data. What I care about the most is that each row of data is written into a new RDD only once! For this reason, I cannot use a randomSplit, although I am looking forward to such a simple solution. Traversing the partitioned sparkContext will not work either.

I understand that splitting an RDD into several RDDs makes no sense, since RDDs are already made to be processed across many clusters (thus automatically get split).

However, splitting an RDD is a requirement according to a highly complex business logic I need to implement the spark code with and I cannot have it any other way.

The solution that I have is to select ranges from a big RDD, and then simply put each range into a new RDD. However, this looks like a time-consuming task, and therefore not a good solution.

I would appreciate if anyone could give me a hand with this, and keep it down to a beginner level.

SOLUTION THAT WORKED FOR ME:

val numberOfRows = 10000

indexedRDD = RDD.zipWithIndex

for (FROM <-1 to numOfPartitions){
val tempRDD = indexedRDD.filter(from=> {from._2>from && from._2 < from+numberOfRows}).map(from=>from._1)
}
like image 373
3xCh1_23 Avatar asked Dec 07 '25 21:12

3xCh1_23


1 Answers

Can you use data from one of the columns and filter according to that?

You could also make a program with mapPartitionsWithIndex that would take first n rows from each partition for the first RDD and then again mapPartitionsWithIndex and take the rest for the second RDD. If you need an exact number of rows, then you would need to do some math there, but it can be done.

like image 53
pzecevic Avatar answered Dec 09 '25 20:12

pzecevic



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!