Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Spark 2.3.1 with Scala, Reduce Arbitrary List of Date Ranges into distinct non-overlapping ranges of dates

Given a list of date ranges, some of which overlap:

val df = Seq(
  ("Mike","2018-09-01","2018-09-10"), // range 1
  ("Mike","2018-09-05","2018-09-05"), // range 1
  ("Mike","2018-09-12","2018-09-12"), // range 1
  ("Mike","2018-09-11","2018-09-11"), // range 1
  ("Mike","2018-09-25","2018-09-29"), // range 4
  ("Mike","2018-09-21","2018-09-23"), // range 4
  ("Mike","2018-09-24","2018-09-24"), // range 4
  ("Mike","2018-09-14","2018-09-16"), // range 2
  ("Mike","2018-09-15","2018-09-17"), // range 2
  ("Mike","2018-09-05","2018-09-05"), // range 1
  ("Mike","2018-09-19","2018-09-19"), // range 3
  ("Mike","2018-09-19","2018-09-19"), // range 3
  ("Mike","2018-08-19","2018-08-20"), // range 5
  ("Mike","2018-10-01","2018-10-20"), // range 6
  ("Mike","2018-10-10","2018-10-30")  // range 6
).toDF("name", "start", "end")

I'd like to reduce the data down to the minimum set of date ranges that completely encapsulate the above dates with no extra dates added:

+----+----------+----------+                                                    
|name|start     |end       |
+----+----------+----------+
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-09-21|2018-09-29|
|Mike|2018-08-19|2018-08-20|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+

EDIT: Added three new entries to the test data to account for new edge cases.

I cannot rely on the dates being in any particular order.

My best attempt at this so far:

  1. Explode each date range into its set of individual days
  2. Union the sets together into one big set of all the days
  3. Sort the set into a list so the days are in order
  4. Aggregate the individual days back into a list of lists of days.
  5. Take the first and last day of each list as the new ranges.

The code, such as it is:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import scala.collection.immutable.NumericRange
import java.time.LocalDate

case class MyRange(start:String, end:String)

val combineRanges = udf((ranges: Seq[Row]) => {
  ranges.map(r => LocalDate.parse(r(0).toString).toEpochDay to LocalDate.parse(r(1).toString).toEpochDay)
    .map(_.toIndexedSeq).reduce(_ ++ _).distinct.toList.sorted
    .aggregate(List.empty[Vector[Long]])((ranges:List[Vector[Long]], d:Long) => {
    ranges.lastOption.find(_.last + 1 == d) match {
      case Some(r:Vector[Long]) => ranges.dropRight(1) :+ (r :+ d)
      case None => ranges :+ Vector(d)
    }
  }, _ ++ _).map(v => MyRange(LocalDate.ofEpochDay(v.head).toString, LocalDate.ofEpochDay(v.last).toString))
})

df.groupBy("name")
  .agg(combineRanges(collect_list(struct($"start", $"end"))) as "ranges")
  .withColumn("ranges", explode($"ranges"))
  .select($"name", $"ranges.start", $"ranges.end")
  .show(false)

It seems to work, but it is very ugly and probably wasteful of time and memory.

I was kind of hoping to use the scala Range class to only notionally explode the date ranges into their individual days, but I've got a feeling the sort operation forces scala's hand and makes it actually create a list of all the dates in memory.

Does anyone have a better way of doing this?

like image 955
Jeremy Avatar asked Oct 28 '18 09:10

Jeremy


Video Answer


3 Answers

I think the easiest (and most readable) way is to explode the ranges into individual days and then aggregate back into intervals. As the number of days cannot grow too large, I think exploding is not a bottleneck here. I show a "pure-Scala" solution which is then used inside an UDF which gets all the intervals from a collect_list aggregation :

import java.time.LocalDate
import java.time.temporal.ChronoUnit

def enumerateDays(start: LocalDate, end: LocalDate) = {
  Iterator.iterate(start)(d => d.plusDays(1L))
    .takeWhile(d => !d.isAfter(end)) 
    .toList
}

implicit val localDateOrdering: Ordering[LocalDate] = Ordering.by(_.toEpochDay)

val combineRanges = udf((data: Seq[Row]) => {
  val dateEnumerated =
    data
      .toSet[Row] // use Set to save memory if many spans overlap
      // "explode" date spans into individual days
      .flatMap { case Row(start: String, end: String) => enumerateDays(LocalDate.parse(start), LocalDate.parse(end)) }
      .toVector
      .sorted

  // combine subsequent dates into Vectors
  dateEnumerated.tail
    // combine subsequent dates into Vectors
    .foldLeft(Vector(Vector(dateEnumerated.head)))((agg, curr) => {
    if (ChronoUnit.DAYS.between(agg.last.last, curr) == 1L) {
      agg.init :+ (agg.last :+ curr)
    } else {
      agg :+ Vector(curr)
    }
  })
    // now get min/max of dates per span
    .map(r => (r.min.toString, r.max.toString))
})

df.groupBy("name")
  .agg(combineRanges(collect_list(struct($"start", $"end"))) as "ranges")
  .withColumn("ranges", explode($"ranges"))
  .select($"name", $"ranges._1".as("start"), $"ranges._2".as("end"))
  .show(false)

gives

+----+----------+----------+
|name|start     |end       |
+----+----------+----------+
|Mike|2018-08-19|2018-08-20|
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-09-21|2018-09-29|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+

I think it's also doable with more logic DataFrame API. I would still explode using UDFs, but then use Window-Functions and groupBy to build the new block based on number of days between 2 dates. But I think the above solution is also ok

like image 67
Raphael Roth Avatar answered Nov 15 '22 06:11

Raphael Roth


Here is an alternative with DFs and SPARK SQL both non-procedural and procedural by definition. You need to read well and persist.

// Aspects such as caching and re-partitioning for performance not considered. On the other hand it all happens under the bonnet wth DF's - so they say.
// Functional only.
import org.apache.spark.sql.functions._
import spark.implicits._
import java.time._
import org.apache.spark.sql.functions.{lead, lag}
import org.apache.spark.sql.expressions.Window

def toEpochDay(s: String) = LocalDate.parse(s).toEpochDay
val toEpochDayUdf = udf(toEpochDay(_: String))

val df = Seq(
("Betty","2018-09-05","2018-09-05"),  ("Betty","2018-09-05","2018-09-05"), 
("Betty","2018-09-05","2018-09-08"),  ("Betty","2018-09-07","2018-09-10"),  
("Betty","2018-09-07","2018-09-08"),  ("Betty","2018-09-06","2018-09-07"),  
("Betty","2018-09-10","2018-09-15"),  ("Betty","2017-09-10","2017-09-15"),
("XXX","2017-09-04","2017-09-10"),    ("XXX","2017-09-10","2017-09-15"),
("YYY","2017-09-04","2017-09-10"),    ("YYY","2017-09-11","2017-09-15"),
("Bob","2018-09-01","2018-09-02"),    ("Bob","2018-09-04","2018-09-05"),  
("Bob","2018-09-06","2018-09-07"),    ("Bob","2019-09-04","2019-09-05"),  
("Bob","2019-09-06","2019-09-07"),    ("Bob","2018-09-08","2018-09-22")   
           ).toDF("name", "start", "end")

// Remove any duplicates - pointless to n-process these!
val df2 = df.withColumn("s", toEpochDayUdf($"start")).withColumn("e", toEpochDayUdf($"end")).distinct  
df2.show(false) // The original input
df2.createOrReplaceTempView("ranges")

// Find those records encompassed by a broader time frame and hence not required for processing.
val q = spark.sql("""  SELECT * 
                         FROM ranges r1
                        WHERE EXISTS (SELECT r2.name                        
                                        FROM ranges r2
                                       WHERE r2.name = r1.name 
                                         AND ((r1.s >= r2.s AND r1.e < r2.e) OR 
                                              (r1.e <= r2.e AND r1.s > 2.s))
                                     ) 
                  """)   
//q.show(false)

val df3 = df2.except(q) // Overlapping or on their own / single range records left.
//df3.show(false)
df3.createOrReplaceTempView("ranges2")

// Find those ranges that have a gap between them and the next adjacent records, before or after, i.e. records that exist on their own and are in fact per de facto the first part of the answer.
val q2 = spark.sql("""  SELECT * 
                         FROM ranges2 r1
                        WHERE NOT EXISTS (SELECT r2.name                        
                                            FROM ranges2 r2
                                           WHERE r2.name = r1.name 
                                             AND (r2.e >= r1.s - 1 AND r2.s <= r1.s - 1 ) OR
                                                 (r2.s <= r1.e + 1 AND r2.e >= r1.e + 1 )) 
                                          ) 
                   """)

// Store the first set of records that exist on their own with some form of gap, first part of result overall result set.                                                    
val result1 = q2.select("name", "start", "end")
result1.show(false) 

// Get the records / ranges that have overlaps to process - the second remaining set of such records to process.
val df4 = df3.except(q2) 
//df4.show(false)

//Avoid Serialization errors with lag!
@transient val w = org.apache.spark.sql.expressions.Window.partitionBy("name").orderBy("e")
@transient val lag_y = lag("e", 1, -99999999).over(w)
//df.select(lag_y).map(f _).first
val df5 = df4.withColumn("new_col", lag_y)
//df5.show(false)

// Massage data to get results via easier queries, e.g. avoid issues with correlated sub-queries.
val myExpression = "s - new_col"
val df6 = df5.withColumn("result", when($"new_col" === 0, 0).otherwise(expr(myExpression)))
//df6.show(false)
df6.createOrReplaceTempView("ranges3")

val q3 = spark.sql("""  SELECT *, dense_rank() over (PARTITION BY name ORDER BY start ASC) as RANK
                          FROM ranges3
                          WHERE new_col = -99999999 OR result > 1
                   """)
q3.createOrReplaceTempView("rangesSTARTS")

val q4 = spark.sql("""  SELECT *
                          FROM ranges3
                         WHERE result <= 1 AND new_col <> -99999999 
                   """)
q4.createOrReplaceTempView("rangesFOLLOWERS")

val q5 = spark.sql("""  SELECT r1.*, r2.start as next_start
                          FROM rangesSTARTS r1 LEFT JOIN rangesSTARTS r2
                           ON r2.name = r1.name 
                          AND r2.RANK = r1.RANK + 1 
                   """)
//q5.show(false)

val q6 = q5.withColumn("end_period", when($"next_start".isNull, "2525-01-01").otherwise($"next_start"))
//q6.show(false)
q6.createOrReplaceTempView("rangesSTARTS2")

// Second and final set of results - the head and tail of such set of range records.
val result2 = spark.sql("""  SELECT r1.name, r1.start, MAX(r2.end) as end
                               FROM rangesFOLLOWERS r2, rangesSTARTS2 r1
                              WHERE r2.name = r1.name
                                AND r2.end >= r1.start 
                                AND r2.end <  r1.end_period
                           GROUP BY r1.name, r1.start """)   
result2.show(false)

val finalresult = result1.union(result2)
finalresult.show

returns:

+-----+----------+----------+
| name|     start|       end|
+-----+----------+----------+
|  Bob|2018-09-01|2018-09-02|
|Betty|2017-09-10|2017-09-15|
|  YYY|2017-09-04|2017-09-15|
|  Bob|2018-09-04|2018-09-22|
|  Bob|2019-09-04|2019-09-07|
|  XXX|2017-09-04|2017-09-15|
|Betty|2018-09-05|2018-09-15|
+-----+----------+----------+

An interesting contrast - what is better for performance and style? My last such effort for a while. Interested in your comments. You know the programming aspects better than I, so this question provides some good comparison and some good education. the other solutions do explode, not how I saw it.

like image 25
thebluephantom Avatar answered Nov 15 '22 07:11

thebluephantom


I think my second approach is better, but still far from perfect. It at least avoids iterating through every day in every date range, although it now processes every range multiple times. I think I'm mostly going to be processing a few large ranges instead of a bunch of little ranges, so maybe that's ok.

Given:

val ranges = Seq(
  ("Mike","2018-09-01","2018-09-10"),
  ("Mike","2018-09-05","2018-09-05"),
  ("Mike","2018-09-12","2018-09-12"),
  ("Mike","2018-09-11","2018-09-11"),
  ("Mike","2018-09-25","2018-09-30"),
  ("Mike","2018-09-21","2018-09-23"),
  ("Mike","2018-09-24","2018-09-24"),
  ("Mike","2018-09-14","2018-09-16"),
  ("Mike","2018-09-15","2018-09-17"),
  ("Mike","2018-09-05","2018-09-05"),
  ("Mike","2018-09-19","2018-09-19"),
  ("Mike","2018-09-19","2018-09-19"),
  ("Mike","2018-08-19","2018-08-20"),
  ("Mike","2018-10-01","2018-10-20"),
  ("Mike","2018-10-10","2018-10-30")
)
val df = ranges.toDF("name", "start", "end")

I want:

+----+----------+----------+                                                    
|name|start     |end       |
+----+----------+----------+
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-21|2018-09-30|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-08-19|2018-08-20|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+

(They are not in order this time. I'm ok with that since it was never a requirement. It just happened to be an artifact of my previous approach)

// very specific helper functions to convert a date string to and from a range
implicit class MyString(s:String) {
  def toFilteredInt: Int = s.filter(_.isDigit).toInt
  def to(s2:String): Range = s.toFilteredInt to s2.toFilteredInt
  // this only works for YYYYMMDD strings. very dangerous.
  def toDateStr = s"${s.slice(0,4)}-${s.slice(4,6)}-${s.slice(6,8)}"
}

// helper functions to combine two ranges
implicit class MyRange(r:Range) {
  def expand(i: Int): Range = r.head - i * r.step to r.last + i * r.step
  def containsPart(r2:Range): Boolean = r.contains(r2.head) || r.contains(r2.last)
  def intersects(r2:Range): Boolean = r.containsPart(r2) || r2.containsPart(r)
  def combine(r2:Range): Option[Range] = {
    if (r.step == r2.step && r.intersects(r2 expand 1)) {
      if (r.step > 0) Some(Math.min(r.head, r2.head) to Math.max(r.last, r2.last))
      else Some(Math.max(r.head, r2.head) to Math.min(r.last, r2.last))
    }
    else None
  }
  def toDateStrTuple: (String,String) = (r.start.toString.toDateStr, r.end.toString.toDateStr)
}

// combines a range to one of the ranges in a sequence if possible;
// adds it to the sequence if it can't be combined.
def addToRanges(rs:Seq[Range], r:Range): Seq[Range] = {
  if (rs.isEmpty) Seq(r)
  else r.combine(rs.last) match {
    case Some(r:Range) => rs.init :+ r
    case None => addToRanges(rs.init, r) :+ rs.last
  }
}

// tries to combine every range in the sequence with every other range
// does not handle the case where combining two ranges together allows
// them to be combined with a previous range in the sequence.
// however, if we call this and nothing has been combined, we know
// we are done
def collapseOnce(rs:Seq[Range]):Seq[Range] = {
  if (rs.size <= 1) rs
  else addToRanges(collapseOnce(rs.init), rs.last)
}

// keep collapsing the sequence of ranges until they can't collapse
// any further
def collapseAll(rs:Seq[Range]):Seq[Range] = {
  val collapsed = collapseOnce(rs)
  if (rs.size == collapsed.size) rs
  else collapseAll(collapsed)
}

// now our udf is much simpler
val combineRanges = udf((rows: Seq[Row]) => {
  val ranges  = rows.map(r => r(0).toString to r(1).toString)
  collapseAll(ranges).map(_.toDateStrTuple)
})


df.groupBy("name").agg(combineRanges(collect_list(struct($"start", $"end"))) as "ranges"
  ).withColumn("ranges", explode($"ranges")
  ).select($"name", $"ranges._1" as "start", $"ranges._2" as "end").show(false)

Room for improvement:

  • I'm pretty sure I'd get better performance most of the time if I bailed out of collapseOnce as soon as I found a range to combine. The typical use case will be adding a single day range to the last range in the sequence.
  • collapseOnce and addToRanges are not yet tail recursive.
  • Some of the date to string and string to date methods in my implicit classes should probably be stand alone methods. They are super specific to my problem and don't deserve to be included in general String and Range classes.
like image 43
Jeremy Avatar answered Nov 15 '22 05:11

Jeremy