This question is quite relevant, but is 2 years old: In memory OLAP engine in Java
I would like to create a pivot-table like matrix from a given tabular dataset, in memory
e.g. an age by marital status count (rows are age, columns are marital status).
The input: List of People, with age and some Boolean property (e.g. married),
The desired output: count of People, by age (row) and isMarried (column)
case class Person(val age:Int, val isMarried:Boolean)
...
val people:List[Person] = ... //
val peopleByAge = people.groupBy(_.age) //only by age
val peopleByMaritalStatus = people.groupBy(_.isMarried) //only by marital status
I managed to do it the naive way, first grouping by age, then map
which is doing a count
by marital status, and outputs the result, then I foldRight
to aggregate
TreeMap(peopleByAge.toSeq: _*).map(x => {
val age = x._1
val rows = x._2
val numMarried = rows.count(_.isMarried())
val numNotMarried = rows.length - numMarried
(age, numMarried, numNotMarried)
}).foldRight(List[FinalResult]())(row,list) => {
val cumMarried = row._2+
(if (list.isEmpty) 0 else list.last.cumMarried)
val cumNotMarried = row._3 +
(if (list.isEmpty) 0 else l.last.cumNotMarried)
list :+ new FinalResult(row._1, row._2, row._3, cumMarried,cumNotMarried)
}.reverse
I don't like the above code, it's not efficient, hard to read, and I'm sure there is a better way.
How do I groupBy "both"? and how do I do a count for each subgroup, e.g.
How many people are exactly 30 years old and married?
Another question, is how do I do a running total, to answer the question:
How many people above 30 are married?
Edit:
Thank you for all the great answers.
just to clarify, I would like the output to include a "table" with the following columns
Not only answering those specific queries, but to produce a report that will allow answering all such type of questions.
XLCubed lets users add standard Excel formulae into cube connected grids (like a pivot table without the restrictions). Users can simply add a new column or row and type any Excel formula, including Vlookups.
In the PivotTable, right-click the value field you want to change, and then click Summarize Values By. Click the summary function you want. Note: Summary functions aren't available in PivotTables that are based on Online Analytical Processing (OLAP) source data. The sum of the values.
A pivot table is a statistics tool that summarizes and reorganizes selected columns and rows of data in a spreadsheet or database table to obtain a desired report. The tool does not actually change the spreadsheet or database itself, it simply “pivots” or turns the data to view it from different perspectives.
Here is an option that is a little more verbose, but does this in a generic fashion instead of using strict data types. You could of course use generics to make this nicer, but i think you get the idea.
/** Creates a new pivot structure by finding correlated values
* and performing an operation on these values
*
* @param accuOp the accumulator function (e.g. sum, max, etc)
* @param xCol the "x" axis column
* @param yCol the "y" axis column
* @param accuCol the column to collect and perform accuOp on
* @return a new Pivot instance that has been transformed with the accuOp function
*/
def doPivot(accuOp: List[String] => String)(xCol: String, yCol: String, accuCol: String) = {
// create list of indexes that correlate to x, y, accuCol
val colsIdx = List(xCol, yCol, accuCol).map(headers.getOrElse(_, 1))
// group by x and y, sending the resulting collection of
// accumulated values to the accuOp function for post-processing
val data = body.groupBy(row => {
(row(colsIdx(0)), row(colsIdx(1)))
}).map(g => {
(g._1, accuOp(g._2.map(_(colsIdx(2)))))
}).toMap
// get distinct axis values
val xAxis = data.map(g => {g._1._1}).toList.distinct
val yAxis = data.map(g => {g._1._2}).toList.distinct
// create result matrix
val newRows = yAxis.map(y => {
xAxis.map(x => {
data.getOrElse((x,y), "")
})
})
// collect it with axis labels for results
Pivot(List((yCol + "/" + xCol) +: xAxis) :::
newRows.zip(yAxis).map(x=> {x._2 +: x._1}))
}
my Pivot type is pretty basic:
class Pivot(val rows: List[List[String]]) {
val headers = rows.head.zipWithIndex.toMap
val body = rows.tail
...
}
And to test it, you could do something like this:
val marriedP = Pivot(
List(
List("Name", "Age", "Married"),
List("Bill", "42", "TRUE"),
List("Heloise", "47", "TRUE"),
List("Thelma", "34", "FALSE"),
List("Bridget", "47", "TRUE"),
List("Robert", "42", "FALSE"),
List("Eddie", "42", "TRUE")
)
)
def accum(values: List[String]) = {
values.map(x => {1}).sum.toString
}
println(marriedP + "\n")
println(marriedP.doPivot(accum)("Age", "Married", "Married"))
Which yields:
Name Age Married
Bill 42 TRUE
Heloise 47 TRUE
Thelma 34 FALSE
Bridget 47 TRUE
Robert 42 FALSE
Eddie 42 TRUE
Married/Age 47 42 34
TRUE 2 2
FALSE 1 1
The nice thing is that you can use currying to pass in any function for the values like you would in a traditional excel pivot table.
More can be found here: https://github.com/vinsonizer/pivotfun
You can
val groups = people.groupBy(p => (p.age, p.isMarried))
and then
val thirty_and_married = groups((30, true))._2
val over_thirty_and_married_count =
groups.filterKeys(k => k._1 > 30 && k._2).map(_._2.length).sum
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With