Dataframe sample in Apache spark | Scala

Tags:

I'm trying to take out samples from two dataframes wherein I need the ratio of count maintained. eg

df1.count() = 10 df2.count() = 1000  noOfSamples = 10

I want to sample the data in such a way that i get 10 samples of size 101 each( 1 from df1 and 100 from df2)

Now while doing so,

var newSample = df1.sample(true, df1.count() / noOfSamples) println(newSample.count())

What does the fraction here imply? can it be greater than 1? I checked this and this but wasn't able to comprehend it fully.

Also is there anyway we can specify the number of rows to be sampled?

294

asked May 24 '16 14:05

hbabbar

Video Answer

1 Answers

The fraction parameter represents the aproximate fraction of the dataset that will be returned. For instance, if you set it to 0.1, 10% (1/10) of the rows will be returned. For your case, I believe you want to do the following:

val newSample = df1.sample(true, 1D*noOfSamples/df1.count)

However, you may notice that newSample.count will return a different number each time you run it, and that's because the fraction will be a threshold for a random-generated value (as you can see here), so the resulting dataset size can vary. An workaround can be:

val newSample = df1.sample(true, 2D*noOfSamples/df1.count).limit(df1.count/noOfSamples)

Some scalability observations

You may note that doing a df1.count might be expensive as it evaluates the whole DataFrame, and you'll lose one of the benefits of sampling in the first place.

Therefore depending on the context of your application, you may want to use an already known number of total samples, or an approximation.

val newSample = df1.sample(true, 1D*noOfSamples/knownNoOfSamples)

Or assuming the size of your DataFrame as huge, I would still use a fraction and use limit to force the number of samples.

val guessedFraction = 0.1 val newSample = df1.sample(true, guessedFraction).limit(noOfSamples)

As for your questions:

can it be greater than 1?

No. It represents a fraction between 0 and 1. If you set it to 1 it will bring 100% of the rows, so it wouldn't make sense to set it to a number larger than 1.

Also is there anyway we can specify the number of rows to be sampled?

You can specify a larger fraction than the number of rows you want and then use limit, as I show in the second example. Maybe there is another way, but this is the approach I use.

194

answered Oct 07 '22 15:10

Daniel de Paula

Related questions
                            
                                java.lang.ClassCastException using lambda expressions in spark job on remote server
                            
                                How to use orderby() with descending order in Spark window functions?
                            
                                Exploding nested Struct in Spark dataframe
                            
                                How to create a sample single-column Spark DataFrame in Python?
                            
                                How does Distinct() function work in Spark?
                            
                                How to replace null values with a specific value in Dataframe using spark in Java?
                            
                                How do I replace a string value with a NULL in PySpark?
                            
                                SparkSQL - Read parquet file directly
                            
                                How to make shark/spark clear the cache?
                            
                                IllegalAccessError to guava's StopWatch from org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus
                            
                                PySpark Logging?
                            
                                Merge Spark output CSV files with a single header
                            
                                Reading multiple files from S3 in Spark by date period
                            
                                Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?
                            
                                Convert a simple one line string to RDD in Spark
                            
                                What are broadcast variables? What problems do they solve?
                            
                                How to avoid generating crc files and SUCCESS files while saving a DataFrame?
                            
                                How to create SparkSession with Hive support (fails with "Hive classes are not found")?
                            
                                Fill in null with previously known good value with pyspark
                            
                                Count the distinct elements of each group by other field on a Spark 1.6 Dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dataframe sample in Apache spark | Scala

Tags:

dataframe

apache-spark

sample

hbabbar

People also ask

Video Answer

1 Answers

Daniel de Paula

Recent Activity

Donate For Us