Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Scala, how can I do the equivalent of an SQL SUM and GROUP BY?

Tags:

For example, suppose I have

val list: List[(String, Double)]

with values

"04-03-1985", 1.5
"05-03-1985", 2.4
"05-03-1985", 1.3

How could I produce a new List

"04-03-1985", 1.5
"05-03-1985", 3.7
like image 511
deltanovember Avatar asked Aug 22 '11 02:08

deltanovember


People also ask

How does groupBy work in Scala?

Scala groupBy is used for grouping of elements based on some criteria defined as a predicate inside the function. This function internally converts the collection into map object and this map object work on key value pair. This is basically used for storing and retrieving of the elements.

Does groupBy preserve order Scala?

groupBy as defined on TraversableLike produces an immutable. Map , so you can't make this method produce something else. The order of the elements in each entry is already preserved, but not the order of the keys. The keys are the result of the function supplied, so they don't really have an order.

How do you get the sum of count in PySpark?

PySpark – sumDistinct() sumDistinct() in PySpark returns the distinct total (sum) value from a particular column in the DataFrame. It will return the sum by considering only unique values. It will not take duplicate values to form a sum.


2 Answers

Here's a one-liner. It's not particularly readable, unless one really internalizes the types of these higher order functions.

val s = Seq(("04-03-1985" -> 1.5),
            ("05-03-1985" -> 2.4),
            ("05-03-1985" -> 1.3))

s.groupBy(_._1).mapValues(_.map(_._2).sum)
// returns: Map(04-03-1985 -> 1.5, 05-03-1985 -> 3.7)

Another approach is to add the key-value pairs one-by-one using fold,

s.foldLeft(Map[String, Double]()) { case (m, (k, v)) =>
  m + (k -> (v + m.getOrElse(k, 0d)))
}

The equivalent for comprehension is most accessible, in my opinion,

var m = Map[String, Double]()
for ((k, v) <- s) {
  m += k -> (v + m.getOrElse(k, 0d))
}

Maybe something nicer can be done with Scalaz's monoid typeclass for Map.

Note that you can convert between Map[K, V] and Seq[(K, V)] using the toSeq and toMap methods.


Update. After pondering it some more, I think the natural abstraction would be a "multimap" conversion, of type,

def seqToMultimap[A, B](s: Seq[(A, B)]): Map[A, Seq[B]]

With the appropriate implicit extension in one's personal library, one could then write:

s.toMultimap.mapValues(_.sum)

This is the clearest of all, in my opinion!

like image 178
Kipton Barros Avatar answered Oct 08 '22 02:10

Kipton Barros


There is another possibility using Scalaz.

The key point is to notice that, if M is a Monoid, then Map[T, M] is also a Monoid. This means that if I have 2 maps, m1 and m2 I can add them so that, for each similar key, the elements will be added together.

For example, Map[String, List[String]] is a Monoid because List[String] is a Monoid. So given the appropriate Monoid instance in scope, I should be able to do:

  val m1 = Map("a" -> List(1), "b" -> List(3))
  val m2 = Map("a" -> List(2))

  // |+| "adds" two elements of a Monoid together in Scalaz
  m1 |+| m2 === Map("a" -> List(1, 2), "b" -> List(3))

For your question we can see that Map[String, Int] is a Monoid because there is a Monoid instance for the Int type. Let's import it:

  implicit val mapMonoid = MapMonoid[String, Int]

Then, I need a function reduceMonoid, which takes anything that's Traversable and "adds" its elements with a Monoid. I just write the reduceMonoid definition here, for the full implementation, please refer to my post on the Essence of the Iterator Pattern:

  // T is a "Traversable"
  def reduce[A, M : Monoid](reducer: A => M): T[A] => M

Those 2 definitions do not exist in the current Scalaz library but they are not difficult to add (based on the existing Monoid and Traverse typeclasses). And once we have them, the solution to your question is very straightforward:

  val s = Seq(("04-03-1985" -> 1.5),
              ("05-03-1985" -> 2.4),
              ("05-03-1985" -> 1.3))

   // we just put each pair in its own map and we let the Monoid instance
   // "add" the maps together
   s.reduceMonoid(Map(_)) === Map("04-03-1985" -> 1.5,
                                  "05-03-1985" -> 3.7)

If you feel that the code above is a bit obscure (but really concise, right?), I encourage you to check the github project for the EIP post and play with it. One example shows the solution to your question:

   "I can build a map String->Int" >> {
     val map1 = List("a" -> 1, "a" -> 2, "b" -> 3, "c" -> 4, "b" -> 5)
     implicit val mapMonoid = MapMonoid[String, Int]

     map1.reduceMonoid(Map(_)) must_== Map("a" -> 3, "b" -> 8, "c" -> 4)
   }
like image 26
Eric Avatar answered Oct 08 '22 02:10

Eric