Is there a data structure / library to do in memory olap / pivot tables in Java / Scala?

Relevant questions

This question is quite relevant, but is 2 years old: In memory OLAP engine in Java

Background

I would like to create a pivot-table like matrix from a given tabular dataset, in memory

e.g. an age by marital status count (rows are age, columns are marital status).

The input: List of People, with age and some Boolean property (e.g. married),
The desired output: count of People, by age (row) and isMarried (column)

What I've tried (Scala)

case class Person(val age:Int, val isMarried:Boolean)

...
val people:List[Person] = ... //

val peopleByAge = people.groupBy(_.age)  //only by age
val peopleByMaritalStatus = people.groupBy(_.isMarried) //only by marital status

I managed to do it the naive way, first grouping by age, then map which is doing a count by marital status, and outputs the result, then I foldRight to aggregate

TreeMap(peopleByAge.toSeq: _*).map(x => {
    val age = x._1
    val rows = x._2
    val numMarried = rows.count(_.isMarried())
    val numNotMarried = rows.length - numMarried
    (age, numMarried, numNotMarried)
}).foldRight(List[FinalResult]())(row,list) => {
     val cumMarried = row._2+ 
        (if (list.isEmpty) 0 else list.last.cumMarried) 
     val cumNotMarried = row._3 + 
        (if (list.isEmpty) 0 else l.last.cumNotMarried) 
     list :+ new FinalResult(row._1, row._2, row._3, cumMarried,cumNotMarried) 
}.reverse

I don't like the above code, it's not efficient, hard to read, and I'm sure there is a better way.

The question(s)

How do I groupBy "both"? and how do I do a count for each subgroup, e.g.

How many people are exactly 30 years old and married?

Another question, is how do I do a running total, to answer the question:

How many people above 30 are married?

Edit:

Thank you for all the great answers.

just to clarify, I would like the output to include a "table" with the following columns

Age (ascending)
Num Married
Num Not Married
Running Total Married
Running Total Not Married

Not only answering those specific queries, but to produce a report that will allow answering all such type of questions.

434

asked Oct 19 '12 18:10

Eran Medan

2 Answers

Here is an option that is a little more verbose, but does this in a generic fashion instead of using strict data types. You could of course use generics to make this nicer, but i think you get the idea.

/** Creates a new pivot structure by finding correlated values 
  * and performing an operation on these values
  *
  * @param accuOp the accumulator function (e.g. sum, max, etc)
  * @param xCol the "x" axis column
  * @param yCol the "y" axis column
  * @param accuCol the column to collect and perform accuOp on
  * @return a new Pivot instance that has been transformed with the accuOp function
  */
def doPivot(accuOp: List[String] => String)(xCol: String, yCol: String, accuCol: String) = {
  // create list of indexes that correlate to x, y, accuCol
  val colsIdx = List(xCol, yCol, accuCol).map(headers.getOrElse(_, 1))

  // group by x and y, sending the resulting collection of
  // accumulated values to the accuOp function for post-processing
  val data = body.groupBy(row => {
    (row(colsIdx(0)), row(colsIdx(1)))
  }).map(g => {
    (g._1, accuOp(g._2.map(_(colsIdx(2)))))
  }).toMap

  // get distinct axis values
  val xAxis = data.map(g => {g._1._1}).toList.distinct
  val yAxis = data.map(g => {g._1._2}).toList.distinct

  // create result matrix
  val newRows = yAxis.map(y => {
    xAxis.map(x => {
      data.getOrElse((x,y), "")
    })
  })

 // collect it with axis labels for results
 Pivot(List((yCol + "/" + xCol) +: xAxis) :::
   newRows.zip(yAxis).map(x=> {x._2 +: x._1}))
}

my Pivot type is pretty basic:

class Pivot(val rows: List[List[String]]) {

  val headers = rows.head.zipWithIndex.toMap
  val body    = rows.tail
  ...
}

And to test it, you could do something like this:

val marriedP = Pivot(
  List(
    List("Name", "Age", "Married"),
    List("Bill", "42", "TRUE"),
    List("Heloise", "47", "TRUE"),
    List("Thelma", "34", "FALSE"),
    List("Bridget", "47", "TRUE"),
    List("Robert", "42", "FALSE"),
    List("Eddie", "42", "TRUE")

  )
)

def accum(values: List[String]) = {
    values.map(x => {1}).sum.toString
}
println(marriedP + "\n")
println(marriedP.doPivot(accum)("Age", "Married", "Married"))

Which yields:

Name     Age      Married  
Bill     42       TRUE     
Heloise  47       TRUE     
Thelma   34       FALSE    
Bridget  47       TRUE     
Robert   42       FALSE    
Eddie    42       TRUE     

Married/Age  47           42           34           
TRUE         2            2                         
FALSE                     1            1

The nice thing is that you can use currying to pass in any function for the values like you would in a traditional excel pivot table.

More can be found here: https://github.com/vinsonizer/pivotfun

198

answered Nov 08 '22 22:11

Jason V

You can

val groups = people.groupBy(p => (p.age, p.isMarried))

and then

val thirty_and_married = groups((30, true))._2
val over_thirty_and_married_count = 
  groups.filterKeys(k => k._1 > 30 && k._2).map(_._2.length).sum

answered Nov 08 '22 22:11

Rex Kerr

Related questions
                            
                                Scala implicit parameters are marked as unused by the compiler
                            
                                Conditional state monad expressions
                            
                                protected method that takes an abstract super class instance and the "access to protected method not permitted" error
                            
                                How do I configure jEdit for Scala projects?
                            
                                How do I get started developing for a web using Scala?
                            
                                scala case classes questions
                            
                                How to split a Scala script into multiple files
                            
                                Idiomatic IO with scala
                            
                                Composing independent traits
                            
                                Static nested class visibility issue with Scala / Java interop
                            
                                Questions on Scala from a C++ programmer (structs and stl)
                            
                                Scala implicit parameters with defaults defined in the companion object
                            
                                Is it possible to use implicit evidence to force static type compatibility between abstract types?
                            
                                How to put methods in sets?
                            
                                Why does Scala fail to find a secondary implicit value in this one particular case?
                            
                                scala regex replaceAllIn can't replace when replace string looks like a regex?
                            
                                Lift - Autocomplete with Ajax Submission
                            
                                Scala short and type safe cast operator
                            
                                Are path-dependent types type projections?
                            
                                Method that returns List of size n in Shapeless

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a data structure / library to do in memory olap / pivot tables in Java / Scala?

Tags:

data-structures

scala

olap