Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a data structure / library to do in memory olap / pivot tables in Java / Scala?

Relevant questions

This question is quite relevant, but is 2 years old: In memory OLAP engine in Java

Background

I would like to create a pivot-table like matrix from a given tabular dataset, in memory

e.g. an age by marital status count (rows are age, columns are marital status).

  • The input: List of People, with age and some Boolean property (e.g. married),

  • The desired output: count of People, by age (row) and isMarried (column)

What I've tried (Scala)

case class Person(val age:Int, val isMarried:Boolean)

...
val people:List[Person] = ... //

val peopleByAge = people.groupBy(_.age)  //only by age
val peopleByMaritalStatus = people.groupBy(_.isMarried) //only by marital status

I managed to do it the naive way, first grouping by age, then map which is doing a count by marital status, and outputs the result, then I foldRight to aggregate

TreeMap(peopleByAge.toSeq: _*).map(x => {
    val age = x._1
    val rows = x._2
    val numMarried = rows.count(_.isMarried())
    val numNotMarried = rows.length - numMarried
    (age, numMarried, numNotMarried)
}).foldRight(List[FinalResult]())(row,list) => {
     val cumMarried = row._2+ 
        (if (list.isEmpty) 0 else list.last.cumMarried) 
     val cumNotMarried = row._3 + 
        (if (list.isEmpty) 0 else l.last.cumNotMarried) 
     list :+ new FinalResult(row._1, row._2, row._3, cumMarried,cumNotMarried) 
}.reverse

I don't like the above code, it's not efficient, hard to read, and I'm sure there is a better way.

The question(s)

How do I groupBy "both"? and how do I do a count for each subgroup, e.g.

How many people are exactly 30 years old and married?

Another question, is how do I do a running total, to answer the question:

How many people above 30 are married?


Edit:

Thank you for all the great answers.

just to clarify, I would like the output to include a "table" with the following columns

  • Age (ascending)
  • Num Married
  • Num Not Married
  • Running Total Married
  • Running Total Not Married

Not only answering those specific queries, but to produce a report that will allow answering all such type of questions.

like image 434
Eran Medan Avatar asked Oct 19 '12 18:10

Eran Medan


People also ask

What can I use instead of a pivot table?

XLCubed lets users add standard Excel formulae into cube connected grids (like a pivot table without the restrictions). Users can simply add a new column or row and type any Excel formula, including Vlookups.

How do you summarize data in a pivot table?

In the PivotTable, right-click the value field you want to change, and then click Summarize Values By. Click the summary function you want. Note: Summary functions aren't available in PivotTables that are based on Online Analytical Processing (OLAP) source data. The sum of the values.

What is pivot in data warehouse?

A pivot table is a statistics tool that summarizes and reorganizes selected columns and rows of data in a spreadsheet or database table to obtain a desired report. The tool does not actually change the spreadsheet or database itself, it simply “pivots” or turns the data to view it from different perspectives.


2 Answers

Here is an option that is a little more verbose, but does this in a generic fashion instead of using strict data types. You could of course use generics to make this nicer, but i think you get the idea.

/** Creates a new pivot structure by finding correlated values 
  * and performing an operation on these values
  *
  * @param accuOp the accumulator function (e.g. sum, max, etc)
  * @param xCol the "x" axis column
  * @param yCol the "y" axis column
  * @param accuCol the column to collect and perform accuOp on
  * @return a new Pivot instance that has been transformed with the accuOp function
  */
def doPivot(accuOp: List[String] => String)(xCol: String, yCol: String, accuCol: String) = {
  // create list of indexes that correlate to x, y, accuCol
  val colsIdx = List(xCol, yCol, accuCol).map(headers.getOrElse(_, 1))

  // group by x and y, sending the resulting collection of
  // accumulated values to the accuOp function for post-processing
  val data = body.groupBy(row => {
    (row(colsIdx(0)), row(colsIdx(1)))
  }).map(g => {
    (g._1, accuOp(g._2.map(_(colsIdx(2)))))
  }).toMap

  // get distinct axis values
  val xAxis = data.map(g => {g._1._1}).toList.distinct
  val yAxis = data.map(g => {g._1._2}).toList.distinct

  // create result matrix
  val newRows = yAxis.map(y => {
    xAxis.map(x => {
      data.getOrElse((x,y), "")
    })
  })

 // collect it with axis labels for results
 Pivot(List((yCol + "/" + xCol) +: xAxis) :::
   newRows.zip(yAxis).map(x=> {x._2 +: x._1}))
}

my Pivot type is pretty basic:

class Pivot(val rows: List[List[String]]) {

  val headers = rows.head.zipWithIndex.toMap
  val body    = rows.tail
  ...
}

And to test it, you could do something like this:

val marriedP = Pivot(
  List(
    List("Name", "Age", "Married"),
    List("Bill", "42", "TRUE"),
    List("Heloise", "47", "TRUE"),
    List("Thelma", "34", "FALSE"),
    List("Bridget", "47", "TRUE"),
    List("Robert", "42", "FALSE"),
    List("Eddie", "42", "TRUE")

  )
)

def accum(values: List[String]) = {
    values.map(x => {1}).sum.toString
}
println(marriedP + "\n")
println(marriedP.doPivot(accum)("Age", "Married", "Married"))

Which yields:

Name     Age      Married  
Bill     42       TRUE     
Heloise  47       TRUE     
Thelma   34       FALSE    
Bridget  47       TRUE     
Robert   42       FALSE    
Eddie    42       TRUE     

Married/Age  47           42           34           
TRUE         2            2                         
FALSE                     1            1 

The nice thing is that you can use currying to pass in any function for the values like you would in a traditional excel pivot table.

More can be found here: https://github.com/vinsonizer/pivotfun

like image 198
Jason V Avatar answered Nov 08 '22 22:11

Jason V


You can

val groups = people.groupBy(p => (p.age, p.isMarried))

and then

val thirty_and_married = groups((30, true))._2
val over_thirty_and_married_count = 
  groups.filterKeys(k => k._1 > 30 && k._2).map(_._2.length).sum
like image 35
Rex Kerr Avatar answered Nov 08 '22 22:11

Rex Kerr