I saw many questions posted on stratifiedSampling, but none of them answered my question, so asking it as “new post” , hoping to get some update.
I have noticed that there is a difference in results returned by spark API:sampleBy(), this is not much significant for small sized dataframe but is more noticeable for large sized data frame (>1000 rows)
sample code:
val inputRDD:RDD[(Any,Row)] =df.rdd.keyBy(x=> x.get(0))
val keyCount = inputRDD.countByKey()
val sampleFractions = keyCount.map(x => (x._1,{(
x._2.toDouble*sampleSize)/(totalCount*100)})).toMap
val sampleDF = df.stat.sampleBy(cols(0),fractions = sampleFractions,seed = 11L)
total dataframe count:200 Keys count: A:16 B:91 C:54 D:39
fractions : Map(A -> 0.08, B -> 0.455, C -> 0.27, D -> 0.195)
I get only 69 rows as output from df.stat.sampleBy() though I have specified that sample size expected is 100, of course this is specified as fraction to spark API.
Thanks
sampleBy
doesn't guarantee you'll get the exact fractions
of rows. It takes a sample with probability for each record being included equal to fractions
. Depending on a run this value will vary and there is nothing unusual about it.
The result is combined from A -> 16 * 0.08, B -> 91 * 0.455, C -> 54 * 0.27, D -> 39 * 0.195 = ( 1.28 rows + 41.405 rows + 14.58 rows + 7.605 rows)
which will make around 67 rows
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With