Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoiding accidental removal of duplicates when mapping a Set

I really like functional programming concepts, but I've been bitten on two separate occasions now by the same gotcha, when mapping across a collection which happens to be a Set (i.e. automatically removes duplicates). The issue is that after transforming the elements of such a set, the output container is also a set, and so removes any duplicates of the tranformed output.

A very brief REPL session to illustrate the issue:

scala> case class Person(name: String, age: Int)
defined class Person

scala> val students = Set(Person("Alice", 18), Person("Bob", 18), Person("Charles", 19))
students: scala.collection.immutable.Set[Person] = Set(Person(Alice,18), Person(Bob,18), Person(Charles,19))

scala> val totalAge = (students map (_.age)).sum
totalAge: Int = 37

I would of course expect the total age to be 18 + 18 + 19 = 55, but because the students were stored in a Set, so were their ages after the mapping, hence one of the 18s disappeared before the ages were summed.

In real code this is often more insidious and harder to spot, especially if you write utility code which simply takes a Traversable and/or use the output of methods which are declared to return a Traversable (the implementation of which happens to be a Set). It seems to me that these situations are almost impossible to spot reliably, until/unless they manifest as a bug.

So, are there any best practices which will reduce my exposure to this issue? Am I wrong to think about map-ping over a general Traversable as conceptually transforming each element in place, as opposed to adding the transformed elements in turn into some new collection? Should I call .toStream on everything before mapping, if I want to keep this mental model?

Any tips/recommendations would be greatly appreciated.

Update: Most of the answers so far have focused on the mechanics of including the duplicates in the sum. I'm more interested in the practices involved when writing code in the general case - have you drilled yourself to always call toList on every collection before calling map? Do you fastidiously check the concrete classes of all the collections in your app before calling methods on them? Etc.

Fixing up something that's already been identified as a problem is trivial - the hard part is preventing these errors from creeping in in the first place.

like image 496
Andrzej Doyle Avatar asked Apr 03 '12 11:04

Andrzej Doyle


4 Answers

You might wish to use the scalaz foldMap for this purpose, as it works on anything for which there is a Foldable typeclass available. The usage in your case will look like this:

persons foldMap (_.age)

The signature of foldMap is as follows:

trait MA[M[_], A] {
  val value: M[A]

  def foldMap[B](f: A => B)(implicit f: Foldable[M], m: Monoid[B])
}

So; as long as you have some collection CC[A] where CC can be folded over (i.e. traversed), a function from A => B where B is a monoid, you can accumulate a result.

like image 159
oxbow_lakes Avatar answered Nov 10 '22 19:11

oxbow_lakes


As not to drag extra dependencies into the picture:

(0 /: students) { case (sum, s) => sum + s.age }
like image 38
Viktor Klang Avatar answered Nov 10 '22 20:11

Viktor Klang


You can breakOut the collection type

scala> import collection.breakOut
import collection.breakOut

scala> val ages = students.map(_.age)(breakOut): List[Int]
ages: List[Int] = List(18, 18, 19)

Then you can sum as expected

Based on the update to the question, the best practice to prevent these types of errors is good unit test coverage with representative data, together with sensible API's coupled with knowledge of how the scala compiler maintains source types through map/for generators etc. If you're returning a Set of something you should make that obvious, as returning a Collection/Traversable is hiding a relevant implementation detail.

like image 3
Jon Freedman Avatar answered Nov 10 '22 20:11

Jon Freedman


You might like to use the toIterable or toList methods to convert the set to another datastructure first. http://www.scala-lang.org/api/current/scala/collection/immutable/Set.html

(Note that toIterable may return any Iterable, although the reference implementation will not, according to the linked documentation. @Debilski informs me in the comments that it nevertheless returns a Set.)

like image 2
Marcin Avatar answered Nov 10 '22 20:11

Marcin