I want to choose randomly a select number of rows from a dataframe and I know sample method does this, but I am concerned that my randomness should be uniform sampling? So, I was wondering if the sample method of Spark on Dataframes is uniform or not?
Thanks
Spark sampling is defined as the mechanism to get the random sample records from the dataset. Data analysts and data scientists most use data sampling to obtain statistical data on the subset of the dataset before applying it to large datasets.
PySpark sampling ( pyspark. sql. DataFrame. sample() ) is a mechanism to get random sample records from the dataset, this is helpful when you have a larger dataset and wanted to analyze/test a subset of the data for example 10% of the original file.
In summary, Spark sampling can be done on RDD and DataFrame. In order to do sampling, you need to know how much data you wanted to retrieve by specifying fractions. Use withReplacement if you are okay to repeat the random records.
There are a few code paths here:
withReplacement = false && fraction > .4
then it uses a souped up random number generator (rng.nextDouble() <= fraction
) and lets that do the work. This seems like it would be pretty uniform.
withReplacement = false && fraction <= .4
then it uses a more complex algorithm (GapSamplingIterator
) that also seems pretty uniform. At a glance, it looks like it should be uniform also
withReplacement = true
it does close to the same thing, except it can duplicate by the looks of it, so this looks to me like it would not be as uniform as the first two
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With