Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala: groupBy (identity) of List Elements

Tags:

scala

I develop an application that builds pairs of words in (tokenised) text and produces the number of times each pair occurs (even when same-word pairs occur multiple times, it's OK as it'll be evened out later in the algorithm).

When I use

elements groupBy()

I want to group by the elements' content itself, so I wrote the following:

def self(x: (String, String)) = x

/**
 * Maps a collection of words to a map where key is a pair of words and the 
 *  value is number of
 * times this pair
 * occurs in the passed array
 */
def producePairs(words: Array[String]): Map[(String,String), Double] = {
  var table = List[(String, String)]()
  words.foreach(w1 =>
    words.foreach(w2 =>
      table = table ::: List((w1, w2))))


  val grouppedPairs = table.groupBy(self)
  val size = int2double(grouppedPairs.size)
  return grouppedPairs.mapValues(_.length / size)
}

Now, I fully realise that this self() trick is a dirty hack. So I thought a little a came out with a:

grouppedPairs = table groupBy (x => x)

This way it produced what I want. However, I still feel that I clearly miss something and there should be easier way of doing it. Any ideas at all, dear all?

Also, if you'd help me to improve the pairs extraction part, it'll also help a lot – it looks very imperative, C++ - ish right now. Many thanks in advance!

like image 397
sgzmd Avatar asked Nov 21 '10 11:11

sgzmd


2 Answers

I'd suggest this:

def producePairs(words: Array[String]): Map[(String,String), Double] = {
    val table = for(w1 <- words; w2 <- words) yield (w1,w2)
    val grouppedPairs = table.groupBy(identity)
    val size = grouppedPairs.size.toDouble
    grouppedPairs.mapValues(_.length / size)
}

The for comprehension is much easier to read, and there is already a predifined function identity, with is a generalized version of your self.

like image 147
Landei Avatar answered Oct 13 '22 00:10

Landei


you are creating a list of pairs of all words against all words by iterating over words twice, where i guess you just want the neighbouring pairs. the easiest is to use a sliding view instead.

def producePairs(words: Array[String]): Map[(String, String), Int] = {
  val pairs   = words.sliding(2, 1).map(arr => arr(0) -> arr(1)).toList
  val grouped = pairs.groupBy(t => t)
  grouped.mapValues(_.size)
}

another approach would be to fold the list of pairs by summing them up. not sure though that this is more efficient:

def producePairs(words: Array[String]): Map[(String, String), Int] = {
  val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1))
  pairs.foldLeft(Map.empty[(String, String), Int]) { (m, p) =>
     m + (p -> (m.getOrElse(p, 0) + 1))
  }
}

i see you are return a relative number (Double). for simplicity i have just counted the occurances, so you need to do the final division. i think you want to divide by the number of total pairs (words.size - 1) and not by the number of unique pairs (grouped.size)..., so the relative frequencies sum up to 1.0

like image 32
0__ Avatar answered Oct 12 '22 23:10

0__