Reading the spark documentation: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample There is this boolean parameter <code>withReplacement</code> without much explanation. <blockquote> sample(withReplacement, fraction, seed=None) </blockquote> What is it and how do we use it?

The parameter <code>withReplacement</code> controls the Uniqueness of <code>sample</code> result. If we treat a Dataset as a bucket of balls, <code>withReplacement=true</code> means, taking a random ball out of the bucket and place it back into it. that means, the same ball can be picked up again. Assuming all unique elements in a Dataset: <ul> <li><code>withReplacement=true</code>, same element can be produced more than once as the result of <code>sample</code>.</li> <li> <code>withReplacement=false</code>, each element of the dataset will be sampled only once. <pre class="prettyprint"><code> import spark.implicits._ val df = Seq(1, 2, 3, 5, 6, 7, 8, 9, 10).toDF("ids") df.show() df.sample(true, 0.5, 5) .show df.sample(false, 0.5, 5) .show </code></pre> Result <pre class="prettyprint"><code>+---+ |ids| +---+ | 1| | 2| | 3| | 5| | 6| | 7| | 8| | 9| | 10| +---+ +---+ |ids| +---+ | 6| | 7| | 7| | 9| | 10| +---+ +---+ |ids| +---+ | 1| | 3| | 7| | 8| | 9| +---+ </code></pre> </li> </ul>

What does withReplacement do, if specified for sample against a Spark Dataframe

1 Answers

The parameter withReplacement controls the Uniqueness of sample result. If we treat a Dataset as a bucket of balls, withReplacement=true means, taking a random ball out of the bucket and place it back into it. that means, the same ball can be picked up again.

Assuming all unique elements in a Dataset:

withReplacement=true, same element can be produced more than once as the result of sample.

withReplacement=false, each element of the dataset will be sampled only once.

   import spark.implicits._

    val df = Seq(1, 2, 3, 5, 6, 7, 8, 9, 10).toDF("ids")

    df.show()

    df.sample(true, 0.5, 5)
      .show

    df.sample(false, 0.5, 5)
      .show

Result

+---+
|ids|
+---+
|  1|
|  2|
|  3|
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
+---+

+---+
|ids|
+---+
|  6|
|  7|
|  7|
|  9|
| 10|
+---+

+---+
|ids|
+---+
|  1|
|  3|
|  7|
|  8|
|  9|
+---+

171

answered Sep 20 '22 03:09

ryandam

Related questions
                            
                                How to modify a column value in a row of a spark dataframe?
                            
                                UDF to extract only the file name from path in Spark SQL
                            
                                How to find mean of grouped Vector columns in Spark SQL?
                            
                                Converting dataframe columns into list of tuples
                            
                                Add PySpark RDD as new column to pyspark.sql.dataframe
                            
                                SparkConf settings not used when running Spark app in cluster mode on YARN
                            
                                Apache Spark subtract days from timestamp column
                            
                                pyspark throws TypeError: textFile() missing 1 required positional argument: 'name'
                            
                                Saving dataframe records in a tab delimited file
                            
                                How to extract number from string column?
                            
                                In pyspark, is it possible to fillna with another column?
                            
                                filter only not empty arrays dataframe spark [duplicate]
                            
                                How to set up mesos for running spark on standalone OS/X
                            
                                Ungrouping a (key, list(values)) pair in Spark/Scala
                            
                                Filter out rows with NaN values for certain column
                            
                                How to connect to Amazon Redshift or other DB's in Apache Spark?
                            
                                Spark Shell stuck in YARN Accepted state
                            
                                Calculate a grouped median in pyspark
                            
                                spark scala : Convert Array of Struct column to String column
                            
                                spark select and add columns with alias

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What does withReplacement do, if specified for sample against a Spark Dataframe

Tags:

apache-spark

Yuchen

People also ask

1 Answers

ryandam

Recent Activity

Donate For Us