Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SPARK Is sample method on Dataframes uniform sampling?

I want to choose randomly a select number of rows from a dataframe and I know sample method does this, but I am concerned that my randomness should be uniform sampling? So, I was wondering if the sample method of Spark on Dataframes is uniform or not?

Thanks

like image 678
Zahra I.S Avatar asked Jul 26 '15 02:07

Zahra I.S


People also ask

What is Spark sampling?

Spark sampling is defined as the mechanism to get the random sample records from the dataset. Data analysts and data scientists most use data sampling to obtain statistical data on the subset of the dataset before applying it to large datasets.

Is PySpark sample random?

PySpark sampling ( pyspark. sql. DataFrame. sample() ) is a mechanism to get random sample records from the dataset, this is helpful when you have a larger dataset and wanted to analyze/test a subset of the data for example 10% of the original file.

Can you collect Spark samples?

In summary, Spark sampling can be done on RDD and DataFrame. In order to do sampling, you need to know how much data you wanted to retrieve by specifying fractions. Use withReplacement if you are okay to repeat the random records.


1 Answers

There are a few code paths here:

  • If withReplacement = false && fraction > .4 then it uses a souped up random number generator (rng.nextDouble() <= fraction) and lets that do the work. This seems like it would be pretty uniform.
  • If withReplacement = false && fraction <= .4 then it uses a more complex algorithm (GapSamplingIterator) that also seems pretty uniform. At a glance, it looks like it should be uniform also
  • If withReplacement = true it does close to the same thing, except it can duplicate by the looks of it, so this looks to me like it would not be as uniform as the first two
like image 191
Justin Pihony Avatar answered Nov 24 '22 03:11

Justin Pihony