Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala Spark : Difference in the results returned by df.stat.sampleBy()

I saw many questions posted on stratifiedSampling, but none of them answered my question, so asking it as “new post” , hoping to get some update.

I have noticed that there is a difference in results returned by spark API:sampleBy(), this is not much significant for small sized dataframe but is more noticeable for large sized data frame (>1000 rows)

sample code:

val inputRDD:RDD[(Any,Row)] =df.rdd.keyBy(x=> x.get(0))
val keyCount = inputRDD.countByKey()
val sampleFractions = keyCount.map(x => (x._1,{(

  x._2.toDouble*sampleSize)/(totalCount*100)})).toMap
val sampleDF = df.stat.sampleBy(cols(0),fractions = sampleFractions,seed = 11L)

total dataframe count:200 Keys count: A:16 B:91 C:54 D:39

fractions : Map(A -> 0.08, B -> 0.455, C -> 0.27, D -> 0.195)

I get only 69 rows as output from df.stat.sampleBy() though I have specified that sample size expected is 100, of course this is specified as fraction to spark API.

Thanks

like image 602
Garipaso Avatar asked Feb 05 '23 19:02

Garipaso


2 Answers

sampleBy doesn't guarantee you'll get the exact fractions of rows. It takes a sample with probability for each record being included equal to fractions. Depending on a run this value will vary and there is nothing unusual about it.

like image 85
user7761554 Avatar answered May 12 '23 12:05

user7761554


The result is combined from A -> 16 * 0.08, B -> 91 * 0.455, C -> 54 * 0.27, D -> 39 * 0.195 = ( 1.28 rows + 41.405 rows + 14.58 rows + 7.605 rows) which will make around 67 rows

like image 25
FaigB Avatar answered May 12 '23 14:05

FaigB