Count instances of combination of columns in spark dataframe using scala

Question

I have a spark data frame in scala called df with two columns, say a and b. Column a contains letters and column b contains numbers giving the below.

   a   b
----------
   g   0
   f   0
   g   0
   f   1

I can get the distinct rows using

val dfDistinct=df.select("a","b").distinct

which gives the following:

   a  b
----------
   g   0
   f   0
   f   1

I want to add another column with the number of times these distinct combinations occurs in the first dataframe so I'd end up with

a  b  count
  ----------
  g  0   2
  f  0   1
  f  1   1

I don't mind if that modifies the original command or I have a separate operation on dfDistinct giving another data frame.

Any advice greatly appreciated and I apologise for the trivial nature of this question but I'm not the most experienced with this kind of operation in scala or spark.

Thanks

Dean

zero323 · Accepted Answer

You can simply aggregate and count:

df.groupBy($"a", $"b").count

or a little bit more verbose:

import org.apache.spark.sql.functions.{count, lit}

df.groupBy($"a", $"b").agg(count(lit(1)).alias("cnt"))

Both are equivalent to a raw SQL aggregation:

df.registerTempTable("df")

sqlContext.sql("SELECT a, b, COUNT(1) AS cnt FROM df GROUP BY a, b")

oluies · Answer

Also see Cross Tabulation

val g="g"
val f = "f"
val df = Seq(
  (g, "0"),
  (f, "0"),
  (g, "0"),
  (f, "1")
).toDF("a", "b")
val res = df.stat.crosstab("a","b")
res.show

+---+---+---+
|a_b|  0|  1|
+---+---+---+
|  g|  2|  0|
|  f|  1|  1|

Count instances of combination of columns in spark dataframe using scala

Tags:

dataframe

scala

apache-spark

Dean

2 Answers

zero323

oluies

Recent Activity

Donate For Us

Count instances of combination of columns in spark dataframe using scala

Tags:

dataframe

scala

apache-spark

Dean

2 Answers

zero323

oluies

Related questions

Recent Activity

Donate For Us