Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does withReplacement do, if specified for sample against a Spark Dataframe

Tags:

apache-spark

Reading the spark documentation: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample

There is this boolean parameter withReplacement without much explanation.

sample(withReplacement, fraction, seed=None)

What is it and how do we use it?

like image 832
Yuchen Avatar asked Dec 09 '18 03:12

Yuchen


People also ask

How do I take samples from Spark DataFrame?

Using the fraction to get a random sample in Sparks in dataframe, "dataframe" value is created in which DataFrame has 100 records and 11% sample records is wanted in which are 11 but the sample() function returned 13 records that are this proves the sample function doesn't return exact fraction specified using the ...

How do you write if else condition in PySpark?

PySpark When Otherwise – when() is a SQL function that returns a Column type and otherwise() is a function of Column, if otherwise() is not used, it returns a None/NULL value. PySpark SQL Case When – This is similar to SQL expression, Usage: CASE WHEN cond1 THEN result WHEN cond2 THEN result... ELSE result END .

What is stratified sampling in Spark?

This method returns a stratified sample without replacement based on the fraction given on each stratum. Parameters: col — the column that defines strata. fractions — The sampling fraction for every stratum. In case of a stratum is not specified, its fraction is treated as zero.


1 Answers

The parameter withReplacement controls the Uniqueness of sample result. If we treat a Dataset as a bucket of balls, withReplacement=true means, taking a random ball out of the bucket and place it back into it. that means, the same ball can be picked up again.

Assuming all unique elements in a Dataset:

  • withReplacement=true, same element can be produced more than once as the result of sample.

  • withReplacement=false, each element of the dataset will be sampled only once.

       import spark.implicits._
    
        val df = Seq(1, 2, 3, 5, 6, 7, 8, 9, 10).toDF("ids")
    
        df.show()
    
        df.sample(true, 0.5, 5)
          .show
    
        df.sample(false, 0.5, 5)
          .show
    

    Result

    +---+
    |ids|
    +---+
    |  1|
    |  2|
    |  3|
    |  5|
    |  6|
    |  7|
    |  8|
    |  9|
    | 10|
    +---+
    
    +---+
    |ids|
    +---+
    |  6|
    |  7|
    |  7|
    |  9|
    | 10|
    +---+
    
    +---+
    |ids|
    +---+
    |  1|
    |  3|
    |  7|
    |  8|
    |  9|
    +---+
    
like image 171
ryandam Avatar answered Sep 20 '22 03:09

ryandam