Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random sampling in pyspark with replacement

I have a dataframe df with 9000 unique ids.

like

| id |
  1 
  2 

I want to generate a random sample with replacement these 9000 ids 100000 times. How do I do it in pyspark

I tried

df.sample(True,0.5,100)

But I do not know how to get to 100000 number exact

like image 720
Shweta Kamble Avatar asked Jun 07 '16 20:06

Shweta Kamble


People also ask

How do you get a random sample in PySpark?

PySpark RDD also provides sample() function to get a random sampling, it also has another signature takeSample() that returns an Array[T]. PySpark RDD sample() function returns the random sampling similar to DataFrame and takes a similar types of parameters but in a different order.

What does .collect do in PySpark?

Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.

How do you shuffle data in PySpark?

shuffle() is used to shuffle the values in an array for all rows in the array type column of the pyspark DataFrame. It will return a new array with shuffled values. It takes the array type column name as a parameter. Please note that it shuffles randomly.

What is Crossjoin PySpark?

Holden's "High-Performance Spark"] Let's start with the cross join. This join simply combines each row of the first table with each row of the second table. For example, we have m rows in one table and n rows in another, this gives us m*n rows in the resulting table.


1 Answers

Okay, so first things first. You will probably not be able to get exactly 100,000 in your (over)sample. The reason why is that in order to sample efficiently, Spark uses something called Bernouilli Sampling. Basically this means it goes through your RDD, and assigns each row a probability of being included. So if you want a 10% sample, each row individually has a 10% chance of being included but it doesn't take into account if it adds up perfectly to the number you want, but it tends to be pretty close for large datasets.

The code would look like this: df.sample(True, 11.11111, 100). This will take a sample of the dataset equal to 11.11111 times the size of the original dataset. Since 11.11111*9,000 ~= 100,000, you will get approximately 100,000 rows.

If you want an exact sample, you have to use df.takeSample(True, 100000). However, this is not a distributed dataset. This code will return an Array (a very large one). If it can be created in Main Memory then do that. However, because you require the exact right number of IDs, I don't know of a way to do that in a distributed fashion.

like image 131
Katya Willard Avatar answered Oct 13 '22 21:10

Katya Willard