SPARK Is sample method on Dataframes uniform sampling?

Tags:

I want to choose randomly a select number of rows from a dataframe and I know sample method does this, but I am concerned that my randomness should be uniform sampling? So, I was wondering if the sample method of Spark on Dataframes is uniform or not?

Thanks

678

asked Jul 26 '15 02:07

Zahra I.S

1 Answers

There are a few code paths here:

If withReplacement = false && fraction > .4 then it uses a souped up random number generator (rng.nextDouble() <= fraction) and lets that do the work. This seems like it would be pretty uniform.
If withReplacement = false && fraction <= .4 then it uses a more complex algorithm (GapSamplingIterator) that also seems pretty uniform. At a glance, it looks like it should be uniform also
If withReplacement = true it does close to the same thing, except it can duplicate by the looks of it, so this looks to me like it would not be as uniform as the first two

191

answered Nov 24 '22 03:11

Justin Pihony

Related questions
                            
                                Spark: difference when read in .gz and .bz2
                            
                                Why python UDF returns unexpected datetime objects where as the same function applied over RDD gives proper datetime object
                            
                                pyspark.sql.utils.IllegalArgumentException: u'java.net.UnknownHostException: user'
                            
                                Apache Spark reads for S3: can't pickle thread.lock objects
                            
                                How to use double pipe as delimiter in CSV?
                            
                                Is it possible to subclass DataFrame in Pyspark?
                            
                                How to handle white spaces in dataframe column names in spark
                            
                                org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout
                            
                                How to split multi-value column into separate rows using typed Dataset?
                            
                                How to tune memory for Spark Application running in local mode
                            
                                How to get data of previous row in Apache Spark
                            
                                How does Spark-submit in cluster deploy mode manage the application Jars
                            
                                When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment
                            
                                Compare Value of Current and Previous Row in Spark
                            
                                How to pass DataFrame as input to Spark UDF?
                            
                                Error while running PySpark DataProc Job due to python version
                            
                                Spark collect_list and limit resulting list
                            
                                call of distinct and map together throws NPE in spark library
                            
                                spark-How can I retrieve item-pair after calculating similarity using RowMatrix
                            
                                Not able to declare String type accumulator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SPARK Is sample method on Dataframes uniform sampling?

Tags:

apache-spark

sample

spark-dataframe

Zahra I.S

People also ask

1 Answers

Justin Pihony

Recent Activity

Donate For Us