I've got some unbalanced data in my LabeledPoint
. what I want to do is select all positives and n
times more negatives (randomly). For example if I have a 100
positives and 30000
negatives, I want to create new LabeledPoint
with all 100
positives and 300
negatives (n=3
).
And in real scenario I don't how many positives and negatives I have on the beginning.
Presumably your data is a RDD[LabeledPoint]
. You can do something like the following:
val pos = rdd.filter(_.label==1)
val numPos=pos.count()
val neg = rdd.filter(_.label==0).takeSample(false, numPos*3)
val undersample = pos.union(neg)
You can find the docs for takeSample
, filter
, and union
here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With