Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count instances of combination of columns in spark dataframe using scala

I have a spark data frame in scala called df with two columns, say a and b. Column a contains letters and column b contains numbers giving the below.

   a   b
----------
   g   0
   f   0
   g   0
   f   1

I can get the distinct rows using

val dfDistinct=df.select("a","b").distinct

which gives the following:

   a  b
----------
   g   0
   f   0
   f   1

I want to add another column with the number of times these distinct combinations occurs in the first dataframe so I'd end up with

a  b  count
  ----------
  g  0   2
  f  0   1
  f  1   1

I don't mind if that modifies the original command or I have a separate operation on dfDistinct giving another data frame.

Any advice greatly appreciated and I apologise for the trivial nature of this question but I'm not the most experienced with this kind of operation in scala or spark.

Thanks

Dean

like image 960
Dean Avatar asked Oct 28 '15 14:10

Dean


2 Answers

You can simply aggregate and count:

df.groupBy($"a", $"b").count

or a little bit more verbose:

import org.apache.spark.sql.functions.{count, lit}

df.groupBy($"a", $"b").agg(count(lit(1)).alias("cnt"))

Both are equivalent to a raw SQL aggregation:

df.registerTempTable("df")

sqlContext.sql("SELECT a, b, COUNT(1) AS cnt FROM df GROUP BY a, b")
like image 126
zero323 Avatar answered Sep 29 '22 22:09

zero323


Also see Cross Tabulation

val g="g"
val f = "f"
val df = Seq(
  (g, "0"),
  (f, "0"),
  (g, "0"),
  (f, "1")
).toDF("a", "b")
val res = df.stat.crosstab("a","b")
res.show

+---+---+---+
|a_b|  0|  1|
+---+---+---+
|  g|  2|  0|
|  f|  1|  1|
like image 30
oluies Avatar answered Sep 29 '22 21:09

oluies