How to get a sample with an exact sample size in Spark RDD?

Tags:

Why does the rdd.sample() function on Spark RDD return a different number of elements even though the fraction parameter is the same? For example, if my code is like below:

val a = sc.parallelize(1 to 10000, 3) a.sample(false, 0.1).count

Every time I run the second line of the code it returns a different number not equal to 1000. Actually I expect to see 1000 every time although the 1000 elements might be different. Can anyone tell me how I can get a sample with the sample size exactly equal to 1000? Thank you very much.

633

asked Sep 29 '15 06:09

Carter

1 Answers

If you want an exact sample, try doing

a.takeSample(false, 1000)

But note that this returns an Array and not an RDD.

As for why the a.sample(false, 0.1) doesn't return the same sample size: it's because spark internally uses something called Bernoulli sampling for taking the sample. The fraction argument doesn't represent the fraction of the actual size of the RDD. It represent the probability of each element in the population getting selected for the sample, and as wikipedia says:

Because each element of the population is considered separately for the sample, the sample size is not fixed but rather follows a binomial distribution.

And that essentially means that the number doesn't remain fixed.

If you set the first argument to true, then it will use something called Poisson sampling, which also results in a non-deterministic resultant sample size.

Update

If you want stick with the sample method, you can probably specify a larger probability for the fraction param and then call take as in:

a.sample(false, 0.2).take(1000)

This should, most of the time, but not necessarily always, result in the sample size of 1000. This could work if you have a large enough population.

100

answered Sep 18 '22 17:09

Bhashit Parikh

Related questions
                            
                                Restrict Autocomplete search to a particular country in Google Places Android API
                            
                                Swift UI Testing Access string in the TextField
                            
                                Sort dictionary by multiple values
                            
                                PyCharm does not recognize cv2 as a module
                            
                                tooltip div with ReactJS
                            
                                How to programmatically create textblock using Segoe MDL2 Assets Font in WPF
                            
                                phpunit test returns 302 for bad validation, why not 422
                            
                                Why need to convert from Integer to int [closed]
                            
                                ASP.NET 5 An error occurred while starting the application
                            
                                How to add random value in Json Body in Gatling?
                            
                                Execute task (or handler) if any task failed
                            
                                Cannot find module '../lib/completion' Inspite of installing Completion

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With