I have RDD of Foo class : class Foo( name : String, createDate : Date ).
I want an other RDD with 10 percent older Foo.
My first idea was to sort by createDate and limit by 0.1*count, but there is no limit function.
Have you an idea?
Assuming Foo is a case class like this:
import java.sql.Date
case class Foo(name: String, createDate: java.sql.Date)
Using plain RDDs:
import org.apache.spark.rdd.RDD
import scala.math.Ordering
val rdd: RDD[Foo] = sc
  .parallelize(Seq(
    ("a", "2015-01-03"), ("b", "2014-11-04"), ("a", "2016-08-10"),
    ("a", "2013-11-11"), ("a", "2015-06-19"), ("a", "2009-11-23")))
  .toDF("name", "createDate")
  .withColumn("createDate", $"createDate".cast("date"))
  .as[Foo].rdd
rdd.cache()
val  n = scala.math.ceil(0.1 * rdd.count).toInt
data fits into driver memory:
and fraction you want is relatively small
rdd.takeOrdered(n)(Ordering.by[Foo, Long](_.createDate.getTime))
// Array[Foo] = Array(Foo(a,2009-11-23))
fraction you want is relatively large:
rdd.sortBy(_.createDate.getTime).take(n)
otherwise
rdd
  .sortBy(_.createDate.getTime)
  .zipWithIndex
  .filter{case (_, idx) => idx < n}
  .keys
Using DataFrame (note - this is actually not optimal performance wise due to limit behavior).
import org.apache.spark.sql.Row
val topN = rdd.toDF.orderBy($"createDate").limit(n)
topN.show
// +----+----------+
// |name|createDate|
// +----+----------+
// |   a|2009-11-23|
// +----+----------+
// Optionally recreate RDD[Foo]
topN.map{case Row(name: String, date: Date) => Foo(name, date)} 
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With