I really like functional programming concepts, but I've been bitten on two separate occasions now by the same gotcha, when mapping across a collection which happens to be a Set
(i.e. automatically removes duplicates). The issue is that after transforming the elements of such a set, the output container is also a set, and so removes any duplicates of the tranformed output.
A very brief REPL session to illustrate the issue:
scala> case class Person(name: String, age: Int)
defined class Person
scala> val students = Set(Person("Alice", 18), Person("Bob", 18), Person("Charles", 19))
students: scala.collection.immutable.Set[Person] = Set(Person(Alice,18), Person(Bob,18), Person(Charles,19))
scala> val totalAge = (students map (_.age)).sum
totalAge: Int = 37
I would of course expect the total age to be 18 + 18 + 19 = 55, but because the students were stored in a Set
, so were their ages after the mapping, hence one of the 18
s disappeared before the ages were summed.
In real code this is often more insidious and harder to spot, especially if you write utility code which simply takes a Traversable
and/or use the output of methods which are declared to return a Traversable
(the implementation of which happens to be a Set). It seems to me that these situations are almost impossible to spot reliably, until/unless they manifest as a bug.
So, are there any best practices which will reduce my exposure to this issue? Am I wrong to think about map
-ping over a general Traversable as conceptually transforming each element in place, as opposed to adding the transformed elements in turn into some new collection? Should I call .toStream
on everything before mapping, if I want to keep this mental model?
Any tips/recommendations would be greatly appreciated.
Update: Most of the answers so far have focused on the mechanics of including the duplicates in the sum. I'm more interested in the practices involved when writing code in the general case - have you drilled yourself to always call toList
on every collection before calling map
? Do you fastidiously check the concrete classes of all the collections in your app before calling methods on them? Etc.
Fixing up something that's already been identified as a problem is trivial - the hard part is preventing these errors from creeping in in the first place.
You might wish to use the scalaz foldMap
for this purpose, as it works on anything for which there is a Foldable
typeclass available. The usage in your case will look like this:
persons foldMap (_.age)
The signature of foldMap
is as follows:
trait MA[M[_], A] {
val value: M[A]
def foldMap[B](f: A => B)(implicit f: Foldable[M], m: Monoid[B])
}
So; as long as you have some collection CC[A]
where CC
can be folded over (i.e. traversed), a function from A => B
where B is a monoid, you can accumulate a result.
As not to drag extra dependencies into the picture:
(0 /: students) { case (sum, s) => sum + s.age }
You can breakOut the collection type
scala> import collection.breakOut
import collection.breakOut
scala> val ages = students.map(_.age)(breakOut): List[Int]
ages: List[Int] = List(18, 18, 19)
Then you can sum as expected
Based on the update to the question, the best practice to prevent these types of errors is good unit test coverage with representative data, together with sensible API's coupled with knowledge of how the scala compiler maintains source types through map/for generators etc. If you're returning a Set of something you should make that obvious, as returning a Collection/Traversable is hiding a relevant implementation detail.
You might like to use the toIterable
ortoList
methods to convert the set to another datastructure first. http://www.scala-lang.org/api/current/scala/collection/immutable/Set.html
(Note that toIterable
may return any Iterable, although the reference implementation will not, according to the linked documentation. @Debilski informs me in the comments that it nevertheless returns a Set.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With