Take part of rdd and keep it rdd

Question

I can't find a way to take just a part on rdd. take seems promising but it returns a list instead of rdd. I of course can then convert it to an rdd, but this seems wasteful and ugly.

 my_rdd = sc.textFile("my_file.csv")
 part_of_my_rdd = sc.parallelize(my_rdd.take(10000))

I there a better way to do this?

Alberto Bonsanto · Accepted Answer

Yes, indeed there is a better way. You can use the sample method from RDDs, it states:

sample(withReplacement, fraction, seed=None)

Return a sampled subset of this RDD.

quantity = 10000
my_rdd = sc.textFile("my_file.csv")
part_of_my_rdd = my_rdd.sample(False, quantity / my_rdd.count())

Take part of rdd and keep it rdd

Tags:

apache-spark

pyspark

Akavall

1 Answers

Alberto Bonsanto

Recent Activity

Donate For Us

Take part of rdd and keep it rdd

Tags:

apache-spark

pyspark

Akavall

1 Answers

Alberto Bonsanto

Related questions

Recent Activity

Donate For Us