Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Take part of rdd and keep it rdd

I can't find a way to take just a part on rdd. take seems promising but it returns a list instead of rdd. I of course can then convert it to an rdd, but this seems wasteful and ugly.

 my_rdd = sc.textFile("my_file.csv")
 part_of_my_rdd = sc.parallelize(my_rdd.take(10000))

I there a better way to do this?

like image 344
Akavall Avatar asked Oct 30 '22 08:10

Akavall


1 Answers

Yes, indeed there is a better way. You can use the sample method from RDDs, it states:

sample(withReplacement, fraction, seed=None)

Return a sampled subset of this RDD.

quantity = 10000
my_rdd = sc.textFile("my_file.csv")
part_of_my_rdd = my_rdd.sample(False, quantity / my_rdd.count())
like image 90
Alberto Bonsanto Avatar answered Nov 15 '22 11:11

Alberto Bonsanto