Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to partition a single RDD into multiple RDD in spark [duplicate]

I have a RDD in which each entry belongs to a class. I want to separate the single RDD into several RDD, such that all entries of a class goes into one RDD. Suppose I have 100 such classes in the input RDD, I want each clas into its own RDD. I can do this with a filter for each class (as shown below), but it would launch several jobs. Is there a better way to do it in a single job?

def method(val input:RDD[LabeledPoint], val classes:List[Double]):List[RDD] = 
      classes.map{lbl=>input.filter(_.label==lbl)}

Its similar to another question, but I have more than 2 classes (around 10)

like image 327
Arun Avatar asked Feb 02 '26 19:02

Arun


1 Answers

I was facing the same issue and unfortunately there is no other way according to different resources I found.

The thing is that you need to go from RDD to create the actual list in your result and if you look here, the answer also says it's not possible.

What you do should be fine and if you want to optimize things, then just go for caching the data if you can.

like image 174
Ivan Nikolov Avatar answered Feb 05 '26 07:02

Ivan Nikolov



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!