I have a dataframe with schema as such: <pre class="prettyprint"><code>[visitorId: string, trackingIds: array<string>, emailIds: array<string>] </code></pre> Looking for a way to group (or maybe rollup?) this dataframe by visitorid where the trackingIds and emailIds columns would append together. So for example if my initial df looks like: <pre class="prettyprint"><code>visitorId |trackingIds|emailIds +-----------+------------+-------- |a158| [666b] | [12] |7g21| [c0b5] | [45] |7g21| [c0b4] | [87] |a158| [666b, 777c]| [] </code></pre> I would like my output df to look like this <pre class="prettyprint"><code>visitorId |trackingIds|emailIds +-----------+------------+-------- |a158| [666b,666b,777c]| [12,''] |7g21| [c0b5,c0b4] | [45, 87] </code></pre> Attempting to use <code>groupBy</code> and <code>agg</code> operators but not have much luck.

Spark >= 2.4 You can replace <code>flatten</code> <code>udf</code> with built-in <code>flatten</code> function <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.sql.functions.flatten </code></pre> leaving the rest as-is. Spark >= 2.0, < 2.4 It is possible but quite expensive. Using data you've provided: <pre class="prettyprint"><code>case class Record( visitorId: String, trackingIds: Array[String], emailIds: Array[String]) val df = Seq( Record("a158", Array("666b"), Array("12")), Record("7g21", Array("c0b5"), Array("45")), Record("7g21", Array("c0b4"), Array("87")), Record("a158", Array("666b", "777c"), Array.empty[String])).toDF </code></pre> and a helper function: <pre class="prettyprint"><code>import org.apache.spark.sql.functions.udf val flatten = udf((xs: Seq[Seq[String]]) => xs.flatten) </code></pre> we can fill the blanks with placeholders: <pre class="prettyprint"><code>import org.apache.spark.sql.functions.{array, lit, when} val dfWithPlaceholders = df.withColumn( "emailIds", when(size($"emailIds") === 0, array(lit(""))).otherwise($"emailIds")) </code></pre> <code>collect_lists</code> and <code>flatten</code>: <pre class="prettyprint"><code>import org.apache.spark.sql.functions.{array, collect_list} val emailIds = flatten(collect_list($"emailIds")).alias("emailIds") val trackingIds = flatten(collect_list($"trackingIds")).alias("trackingIds") df .groupBy($"visitorId") .agg(trackingIds, emailIds) // +---------+------------------+--------+ // |visitorId| trackingIds|emailIds| // +---------+------------------+--------+ // | a158|[666b, 666b, 777c]| [12, ]| // | 7g21| [c0b5, c0b4]|[45, 87]| // +---------+------------------+--------+ </code></pre> With statically typed <code>Dataset</code>: <pre class="prettyprint"><code>df.as[Record] .groupByKey(_.visitorId) .mapGroups { case (key, vs) => vs.map(v => (v.trackingIds, v.emailIds)).toArray.unzip match { case (trackingIds, emailIds) => Record(key, trackingIds.flatten, emailIds.flatten) }} // +---------+------------------+--------+ // |visitorId| trackingIds|emailIds| // +---------+------------------+--------+ // | a158|[666b, 666b, 777c]| [12, ]| // | 7g21| [c0b5, c0b4]|[45, 87]| // +---------+------------------+--------+ </code></pre> Spark 1.x You can convert to RDD and group <pre class="prettyprint"><code>import org.apache.spark.sql.Row dfWithPlaceholders.rdd .map { case Row(id: String, trcks: Seq[String @ unchecked], emails: Seq[String @ unchecked]) => (id, (trcks, emails)) } .groupByKey .map {case (key, vs) => vs.toArray.unzip match { case (trackingIds, emailIds) => Record(key, trackingIds.flatten, emailIds.flatten) }} .toDF // +---------+------------------+--------+ // |visitorId| trackingIds|emailIds| // +---------+------------------+--------+ // | 7g21| [c0b5, c0b4]|[45, 87]| // | a158|[666b, 666b, 777c]| [12, ]| // +---------+------------------+--------+ </code></pre>

How to aggregate values into collection after groupBy?

Tags:

scala

apache-spark

apache-spark-sql

I have a dataframe with schema as such:

[visitorId: string, trackingIds: array<string>, emailIds: array<string>]

Looking for a way to group (or maybe rollup?) this dataframe by visitorid where the trackingIds and emailIds columns would append together. So for example if my initial df looks like:

visitorId   |trackingIds|emailIds +-----------+------------+-------- |a158|      [666b]      |    [12] |7g21|      [c0b5]      |    [45] |7g21|      [c0b4]      |    [87] |a158|      [666b, 777c]|    []

I would like my output df to look like this

visitorId   |trackingIds|emailIds +-----------+------------+-------- |a158|      [666b,666b,777c]|      [12,''] |7g21|      [c0b5,c0b4]     |      [45, 87]

Attempting to use groupBy and agg operators but not have much luck.

975

asked Dec 10 '15 13:12

Eric Patterson

1 Answers

Spark >= 2.4

You can replace flatten udf with built-in flatten function

import org.apache.spark.sql.functions.flatten

leaving the rest as-is.

Spark >= 2.0, < 2.4

It is possible but quite expensive. Using data you've provided:

case class Record(     visitorId: String, trackingIds: Array[String], emailIds: Array[String])  val df = Seq(   Record("a158", Array("666b"), Array("12")),   Record("7g21", Array("c0b5"), Array("45")),   Record("7g21", Array("c0b4"), Array("87")),   Record("a158", Array("666b",  "777c"), Array.empty[String])).toDF

and a helper function:

import org.apache.spark.sql.functions.udf  val flatten = udf((xs: Seq[Seq[String]]) => xs.flatten)

we can fill the blanks with placeholders:

import org.apache.spark.sql.functions.{array, lit, when}  val dfWithPlaceholders = df.withColumn(   "emailIds",    when(size($"emailIds") === 0, array(lit(""))).otherwise($"emailIds"))

collect_lists and flatten:

import org.apache.spark.sql.functions.{array, collect_list}  val emailIds = flatten(collect_list($"emailIds")).alias("emailIds") val trackingIds = flatten(collect_list($"trackingIds")).alias("trackingIds")  df   .groupBy($"visitorId")   .agg(trackingIds, emailIds)  // +---------+------------------+--------+ // |visitorId|       trackingIds|emailIds| // +---------+------------------+--------+ // |     a158|[666b, 666b, 777c]|  [12, ]| // |     7g21|      [c0b5, c0b4]|[45, 87]| // +---------+------------------+--------+

With statically typed Dataset:

df.as[Record]   .groupByKey(_.visitorId)   .mapGroups { case (key, vs) =>      vs.map(v => (v.trackingIds, v.emailIds)).toArray.unzip match {       case (trackingIds, emailIds) =>          Record(key, trackingIds.flatten, emailIds.flatten)   }}  // +---------+------------------+--------+ // |visitorId|       trackingIds|emailIds| // +---------+------------------+--------+ // |     a158|[666b, 666b, 777c]|  [12, ]| // |     7g21|      [c0b5, c0b4]|[45, 87]| // +---------+------------------+--------+

Spark 1.x

You can convert to RDD and group

import org.apache.spark.sql.Row  dfWithPlaceholders.rdd   .map {      case Row(id: String,         trcks: Seq[String @ unchecked],        emails: Seq[String @ unchecked]) => (id, (trcks, emails))   }   .groupByKey   .map {case (key, vs) => vs.toArray.unzip match {     case (trackingIds, emailIds) =>        Record(key, trackingIds.flatten, emailIds.flatten)   }}   .toDF  // +---------+------------------+--------+ // |visitorId|       trackingIds|emailIds| // +---------+------------------+--------+ // |     7g21|      [c0b5, c0b4]|[45, 87]| // |     a158|[666b, 666b, 777c]|  [12, ]| // +---------+------------------+--------+

188

answered Sep 26 '22 06:09

zero323

Related questions
                            
                                configure run in eclipse for Scala
                            
                                How can I cast Integer to String in Scala?
                            
                                Triple colon Scala
                            
                                Unable to find velocity template resources
                            
                                Scala: how can I sort an array of tuples by their second element?
                            
                                How to list all cassandra tables
                            
                                Debunking Scala myths [closed]
                            
                                High-performance Concurrent MultiMap Java/Scala
                            
                                Sets, Functors and Eq confusion
                            
                                Scala: Why mapValues produces a view and is there any stable alternatives?
                            
                                How is Akka used in Play?
                            
                                Scala giving me "illegal start of definition"
                            
                                What is the Scala equivalent to a Java builder pattern?
                            
                                Efficiently repeat a character/string n times in Scala
                            
                                How to create a Source that can receive elements later via a method call?
                            
                                How to query JSON data column using Spark DataFrames?
                            
                                Difference between forward and tell in akka actors
                            
                                Get index of current element in a foreach method of Traversable?
                            
                                Do I need to re-use the same Akka ActorSystem or can I just create one every time I need one?
                            
                                How to write database-agnostic Play application and perform first-time database initialization?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With