Is there any way to get the number of elements in a spark RDD partition, given the partition ID? Without scanning the entire partition.
Something like this:
Rdd.partitions().get(index).size()
Except I don't see such an API for spark. Any ideas? workarounds?
Thanks
The best way to decide on the number of partitions in an RDD is to make the number of partitions equal to the number of cores in the cluster so that all the partitions will process in parallel and the resources will be utilized in an optimal way.
Similarly, in PySpark you can get the current length/size of partitions by running getNumPartitions() of RDD class, so to use with DataFrame first you need to convert to RDD.
An expression for dn may be given in terms of Stirling's numbers. If S(n, k) is the number of onto functions from an n-element set onto a k-element set, then S(n, k)/k! gives the number of partitions of an n-element set into k nonempty subsets. Hence, by the sum rule, dn = S(n,1)/1!
The following gives you a new RDD with elements that are the sizes of each partition:
rdd.mapPartitions(iter => Array(iter.size).iterator, true)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With