Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the number of elements in partition? [duplicate]

Is there any way to get the number of elements in a spark RDD partition, given the partition ID? Without scanning the entire partition.

Something like this:

Rdd.partitions().get(index).size()

Except I don't see such an API for spark. Any ideas? workarounds?

Thanks

like image 404
Geo Avatar asked Feb 24 '15 02:02

Geo


People also ask

How do you determine the number of partitions in a data frame?

The best way to decide on the number of partitions in an RDD is to make the number of partitions equal to the number of cores in the cluster so that all the partitions will process in parallel and the resources will be utilized in an optimal way.

Is there any way to get the current number of partitions of a DataFrame?

Similarly, in PySpark you can get the current length/size of partitions by running getNumPartitions() of RDD class, so to use with DataFrame first you need to convert to RDD.

How can I see all partitions in a set?

An expression for dn may be given in terms of Stirling's numbers. If S(n, k) is the number of onto functions from an n-element set onto a k-element set, then S(n, k)/k! gives the number of partitions of an n-element set into k nonempty subsets. Hence, by the sum rule, dn = S(n,1)/1!


1 Answers

The following gives you a new RDD with elements that are the sizes of each partition:

rdd.mapPartitions(iter => Array(iter.size).iterator, true) 
like image 147
pzecevic Avatar answered Oct 19 '22 17:10

pzecevic