Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating a frequency map for a string in Scala

Tags:

string

map

scala

Let's say I have a string, "hello", and I want to generate a character frequency map:

Map[Char,Int] = Map(h -> 1, e -> 1, o -> 1, l -> 2) 

I could do this iteratively:

val str = "hello" var counts = new scala.collection.mutable.HashMap[Char,Int] for (i <- str) {     if (counts.contains(i))         counts.put(i, counts(i) + 1)     else         counts.put(i, 1) } 

By messing around in the REPL, I've found I can do something a bit more concise and not using a mutable collection:

> str.groupBy(_.toChar).map{ p => (p._1, p._2.length)} scala.collection.immutable.Map[Char,Int] = Map(h -> 1, e -> 1, o -> 1, l -> 2) 

But I don't know about the performance characteristics of groupBy() nor what is going on in the block passed to map (like what, exactly, p is).

How do I do this idiomatically using the functional paradigms in Scala?


For background, I'm just coming to Scala for the first time from Ruby. In Ruby, I would use inject but I'm not sure what the parallel way to do it in Scala is:

counts = str.each_byte.inject(Hash.new(0)){ |h, c| h[c] += 1; h} 
like image 704
nohat Avatar asked Aug 24 '12 07:08

nohat


1 Answers

1) What does p mean?

groupBy takes a function which maps an elements to a key of type K. When invoked on some collection Coll, it returns a Map[K, Coll] which contains mappings from keys K to all the elements which mapped to the same key.

So, in your case, str.groupBy(_.toChar) yields a map mapping from a key k (which is a character) to a string with all the elements (characters) c such that k == c.toChar. You get this:

Map(e -> "e", h -> "h", l -> "ll", o -> "o") 

A Map is an iterable of pairs of keys and values. In this case, each pair is a character and a string of elements. Calling the map operation on a Map involves mapping on these pairs - p is a pair where p._1 is a character, and p._2 is the associated string (on which you can call length, as you did above).

2) How to do this idiomatically

The above is how to do it idiomatically - using groupBy and map. Alternatively, you can use an immutable map and recursion on the string length to compute the frequencies, or an immutable map and a foldLeft.

3) Performance characteristic

Best to benchmark to see the differences. Here are a couple of microbenchmark for a highly-repetitive string (~3GHz iMac, JDK7, Scala 2.10.0 nightly):

object Imperative extends testing.Benchmark {   val str = "abc" * 750000    def run() {     var counts = new scala.collection.mutable.HashMap[Char,Int]     var i = 0     val until = str.length     while (i < until) {       var c = str(i)       if (counts.contains(c))         counts.put(c, counts(c) + 1)       else         counts.put(c, 1)       i += 1     }      //println(f)   } }   object Combinators extends testing.Benchmark {   val str = "abc" * 750000    def run() {     val f = str.groupBy(_.toChar).map(p => (p._1, p._2.length))   } }   object Fold extends testing.Benchmark {   val str = "abc" * 750000    def run() {     val f = str.foldLeft(Map[Char, Int]() withDefaultValue 0){(h, c) => h.updated(c, h(c)+1)}   } } 

Results:

  • Imperative: $ 103 57 53 58 53 53 53 53 53 53

  • Combinators: $ 72 51 63 56 53 52 52 54 53 53

  • Fold: $ 163 62 71 62 57 57 57 58 57 57

Note that changing the imperative version to use withDefaultValue:

var counts = new scala.collection.mutable.HashMap[Char,Int].withDefaultValue(0) var i = 0 val until = str.length while (i < until) {   var c = str(i)   counts.put(c, counts(c) + 1)   i += 1 } 

is apparently terribly slow due to forwarding each put call:

  • withDefaultValue: $ 133 87 109 106 101 100 101 100 101 101

Conclusion: the boxing and unboxing of characters in this case is high-enough so that the differences in performance between these approaches are hard to observe.

EDIT:

Update: You may want to use ScalaMeter inline benchmarking in place of the Benchmark trait.

like image 198
axel22 Avatar answered Oct 11 '22 12:10

axel22