Random sampling in pyspark with replacement

Tags:

I have a dataframe df with 9000 unique ids.

| id |
  1 
  2

I want to generate a random sample with replacement these 9000 ids 100000 times. How do I do it in pyspark

I tried

df.sample(True,0.5,100)

But I do not know how to get to 100000 number exact

720

asked Jun 07 '16 20:06

Shweta Kamble

1 Answers

Okay, so first things first. You will probably not be able to get exactly 100,000 in your (over)sample. The reason why is that in order to sample efficiently, Spark uses something called Bernouilli Sampling. Basically this means it goes through your RDD, and assigns each row a probability of being included. So if you want a 10% sample, each row individually has a 10% chance of being included but it doesn't take into account if it adds up perfectly to the number you want, but it tends to be pretty close for large datasets.

The code would look like this: df.sample(True, 11.11111, 100). This will take a sample of the dataset equal to 11.11111 times the size of the original dataset. Since 11.11111*9,000 ~= 100,000, you will get approximately 100,000 rows.

If you want an exact sample, you have to use df.takeSample(True, 100000). However, this is not a distributed dataset. This code will return an Array (a very large one). If it can be created in Main Memory then do that. However, because you require the exact right number of IDs, I don't know of a way to do that in a distributed fashion.

131

answered Oct 13 '22 21:10

Katya Willard

Related questions
                            
                                Querying for N random records on Appengine datastore
                            
                                Why does RUnit change my random numbers?
                            
                                Generate random number in kernel module
                            
                                Can I make a Deterministic Shuffle in clojure?
                            
                                Multiple iterations of random double numbers tend to get smaller
                            
                                Randomly extract x items from a list using python
                            
                                Does MySQL have a cryptographically secure random number generator?
                            
                                True (not pseudo) random number generators. What's out there? [closed]
                            
                                How can I get a random number from atmospheric noise?
                            
                                MySQL UUID primary key - generated by PHP or by MySQL?
                            
                                Generate a random number from a density object (or more broadly from a set of numbers)
                            
                                math random number without repeating a previous number
                            
                                What is the range of Random.nextDouble in Scala?
                            
                                python seed() not keeping same sequence
                            
                                Python: Random list of numbers in a range keeping with a minimum distance
                            
                                random.randint(1,n) in Python
                            
                                Cheap and cheerful rand() replacement
                            
                                Estimating the digits occurrence probability inside a GUID
                            
                                R: bizarre behavior of set.seed()
                            
                                Pick a number randomly from two numbers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Random sampling in pyspark with replacement

Tags:

random

apache-spark-sql

pyspark

Shweta Kamble

People also ask

1 Answers

Katya Willard

Recent Activity

Donate For Us