I can't find a way to take just a part on rdd
. take
seems promising but it returns a list
instead of rdd
. I of course can then convert it to an rdd
, but this seems wasteful and ugly.
my_rdd = sc.textFile("my_file.csv")
part_of_my_rdd = sc.parallelize(my_rdd.take(10000))
I there a better way to do this?
Yes, indeed there is a better way. You can use the sample method from RDD
s, it states:
sample(withReplacement, fraction, seed=None)
Return a sampled subset of this RDD.
quantity = 10000
my_rdd = sc.textFile("my_file.csv")
part_of_my_rdd = my_rdd.sample(False, quantity / my_rdd.count())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With