I've got a DataFrame I'm operating on, and I want to group by a set of columns and operate per-group on the rest of the columns. In regular <code>RDD</code>-land I think it would look something like this: <pre class="prettyprint"><code>rdd.map( tup => ((tup._1, tup._2, tup._3), tup) ). groupByKey(). forEachPartition( iter => doSomeJob(iter) ) </code></pre> In <code>DataFrame</code>-land I'd start like this: <pre class="prettyprint"><code>df.groupBy("col1", "col2", "col3") // Reference by name </code></pre> but then I'm not sure how to operate on the groups if my operations are more complicated than the mean/min/max/count offered by GroupedData. For example, I want to build a single MongoDB document per <code>("col1", "col2", "col3")</code> group (by iterating through the associated <code>Row</code>s in the group), scale down to <code>N</code> partitions, then insert the docs into a MongoDB database. The <code>N</code> limit is the max number of simultaneous connections I want. Any advice?

You can do a self-join. First get the groups: <pre class="prettyprint"><code>val groups = df.groupBy($"col1", $"col2", $"col3").agg($"col1", $"col2", $"col3") </code></pre> Then you can join this back to the original DataFrame: <pre class="prettyprint"><code>val joinedDF = groups .select($"col1" as "l_col1", $"col2" as "l_col2", $"col3" as "l_col3) .join(df, $"col1" <=> $"l_col1" and $"col2" <=> $"l_col2" and $"col3" <=> $"l_col3") </code></pre> While this gets you exactly the same data you had originally (and with 3 additional, redundant columns) you could do another join to add a column with the MongoDB document ID for the (col1, col2, col3) group associated with the row. At any rate, in my experience joins and self-joins are the way you handle complicated stuff in DataFrames.

Spark DataFrame: operate on groups

Tags:

dataframe

scala

apache-spark

grouping

I've got a DataFrame I'm operating on, and I want to group by a set of columns and operate per-group on the rest of the columns. In regular RDD-land I think it would look something like this:

rdd.map( tup => ((tup._1, tup._2, tup._3), tup) ).
  groupByKey().
  forEachPartition( iter => doSomeJob(iter) )

In DataFrame-land I'd start like this:

df.groupBy("col1", "col2", "col3")  // Reference by name

but then I'm not sure how to operate on the groups if my operations are more complicated than the mean/min/max/count offered by GroupedData.

For example, I want to build a single MongoDB document per ("col1", "col2", "col3") group (by iterating through the associated Rows in the group), scale down to N partitions, then insert the docs into a MongoDB database. The N limit is the max number of simultaneous connections I want.

Any advice?

944

asked May 20 '15 16:05

Ken Williams

1 Answers

You can do a self-join. First get the groups:

val groups = df.groupBy($"col1", $"col2", $"col3").agg($"col1", $"col2", $"col3")

Then you can join this back to the original DataFrame:

val joinedDF = groups
  .select($"col1" as "l_col1", $"col2" as "l_col2", $"col3" as "l_col3)
  .join(df, $"col1" <=> $"l_col1" and $"col2" <=> $"l_col2" and  $"col3" <=> $"l_col3")

While this gets you exactly the same data you had originally (and with 3 additional, redundant columns) you could do another join to add a column with the MongoDB document ID for the (col1, col2, col3) group associated with the row.

At any rate, in my experience joins and self-joins are the way you handle complicated stuff in DataFrames.

146

answered Nov 08 '22 19:11

David Griffin

Related questions
                            
                                In apache maven, how to solve [ERROR] Error fetching link: .../package-list. Ignored it?
                            
                                No operations defined in spec! while specifying multiple paths in swagger ui
                            
                                RabbitMQ client SSL handshake issue on JDK 11
                            
                                Spark SQL exception handling
                            
                                Scala XML.loadString vs literal expression
                            
                                How to include javadoc of scala code when distributing library Jars?
                            
                                Maven improperly starting fsc?
                            
                                Multi-mode XML processors for Java and/or Scala
                            
                                Pattern matching against Scala Map entries
                            
                                SBT: how to include both ordinary jar and test-jar of same dependency
                            
                                Is there a way to pass contextual information to parsers?
                            
                                Dealing with explicit parameters required by inner implicit parameter lists
                            
                                What can I use scala.Singleton for?
                            
                                Scala: using type parameters or abstract types as type bounds
                            
                                How to unionize error types in Scala?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With