I would like to perform some transformations only on a subset of a RDD (to make experimenting in REPL faster).
Is it possible?
RDD has take(num: Int): Array[T]
method, I think I'd need something similar, but returning RDD[T]
You can use RDD.sample
to get an RDD
out, not an Array
. For example, to sample ~1% without replacement:
val data = ...
data.count
...
res1: Long = 18066983
val sample = data.sample(false, 0.01, System.currentTimeMillis().toInt)
sample.count
...
res3: Long = 180190
The third parameter is a seed, and is thankfully optional in the next Spark version.
RDD
s are distributed collections which are materialized on actions only. It is not possible to truncate your RDD
to a fixed size, and still get an RDD
back (hence RDD.take(n)
returns an Array[T]
, just like collect
)
I you want to get similarly sized RDD
s regardless of the input size, you can truncate items in each of your partitions - this way you can better control the absolute number of items in resulting RDD
. Size of the resulting RDD
will depend on spark parallelism.
An example from spark-shell
:
import org.apache.spark.rdd.RDD
val numberOfPartitions = 1000
val millionRdd: RDD[Int] = sc.parallelize(1 to 1000000, numberOfPartitions)
val millionRddTruncated: RDD[Int] = rdd.mapPartitions(_.take(10))
val billionRddTruncated: RDD[Int] = sc.parallelize(1 to 1000000000, numberOfPartitions).mapPartitions(_.take(10))
millionRdd.count // 1000000
millionRddTruncated.count // 10000 = 10 item * 1000 partitions
billionRddTruncated.count // 10000 = 10 item * 1000 partitions
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With