How to split an RDD into multiple (smaller) RDDs given a max number of rows per RDD, and without using an ID column

Question

There are similar questions already asked. The most similar is this one: Spark: How to split an RDD[T]` into Seq[RDD[T]] and preserve the ordering

However, I do not care about preserving the order. Furthermore, I do not have any ID columns in data. What I care about the most is that each row of data is written into a new RDD only once! For this reason, I cannot use a randomSplit, although I am looking forward to such a simple solution. Traversing the partitioned sparkContext will not work either.

I understand that splitting an RDD into several RDDs makes no sense, since RDDs are already made to be processed across many clusters (thus automatically get split).

However, splitting an RDD is a requirement according to a highly complex business logic I need to implement the spark code with and I cannot have it any other way.

The solution that I have is to select ranges from a big RDD, and then simply put each range into a new RDD. However, this looks like a time-consuming task, and therefore not a good solution.

I would appreciate if anyone could give me a hand with this, and keep it down to a beginner level.

SOLUTION THAT WORKED FOR ME:

val numberOfRows = 10000

indexedRDD = RDD.zipWithIndex

for (FROM <-1 to numOfPartitions){
val tempRDD = indexedRDD.filter(from=> {from._2>from && from._2 < from+numberOfRows}).map(from=>from._1)
}

pzecevic · Accepted Answer

Can you use data from one of the columns and filter according to that?

You could also make a program with mapPartitionsWithIndex that would take first n rows from each partition for the first RDD and then again mapPartitionsWithIndex and take the rest for the second RDD. If you need an exact number of rows, then you would need to do some math there, but it can be done.

How to split an RDD into multiple (smaller) RDDs given a max number of rows per RDD, and without using an ID column

Tags:

split

apache-spark

rdd

3xCh1_23

1 Answers

pzecevic

Recent Activity

Donate For Us

How to split an RDD into multiple (smaller) RDDs given a max number of rows per RDD, and without using an ID column

Tags:

split

apache-spark

rdd

3xCh1_23

1 Answers

pzecevic

Related questions

Recent Activity

Donate For Us