Spark DataFrame: does groupBy after orderBy maintain that order?

Tags:

I have a Spark 2.0 dataframe example with the following structure:

id, hour, count id1, 0, 12 id1, 1, 55 .. id1, 23, 44 id2, 0, 12 id2, 1, 89 .. id2, 23, 34 etc.

It contains 24 entries for each id (one for each hour of the day) and is ordered by id, hour using the orderBy function.

I have created an Aggregator groupConcat:

  def groupConcat(separator: String, columnToConcat: Int) = new Aggregator[Row, String, String] with Serializable {     override def zero: String = ""      override def reduce(b: String, a: Row) = b + separator + a.get(columnToConcat)      override def merge(b1: String, b2: String) = b1 + b2      override def finish(b: String) = b.substring(1)      override def bufferEncoder: Encoder[String] = Encoders.STRING      override def outputEncoder: Encoder[String] = Encoders.STRING   }.toColumn

It helps me concatenate columns into strings to obtain this final dataframe:

id, hourly_count id1, 12:55:..:44 id2, 12:89:..:34 etc.

My question is, if I do example.orderBy($"id",$"hour").groupBy("id").agg(groupConcat(":",2) as "hourly_count"), does that guarantee that the hourly counts will be ordered correctly in their respective buckets?

I read that this is not necessarily the case for RDDs (see Spark sort by key and then group by to get ordered iterable?), but maybe it's different for DataFrames ?

If not, how can I work around it ?

204

asked Sep 15 '16 07:09

Ana Todor

1 Answers

groupBy after orderBy doesn't maintain order, as others have pointed out. What you want to do is use a Window function, partitioned on id and ordered by hours. You can collect_list over this and then take the max (largest) of the resulting lists since they go cumulatively (i.e. the first hour will only have itself in the list, the second hour will have 2 elements in the list, and so on).

Complete example code:

import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window import spark.implicits._  val data = Seq(( "id1", 0, 12),   ("id1", 1, 55),   ("id1", 23, 44),   ("id2", 0, 12),   ("id2", 1, 89),   ("id2", 23, 34)).toDF("id", "hour", "count")      val mergeList = udf{(strings: Seq[String]) => strings.mkString(":")}     data.withColumn("collected", collect_list($"count")                                                     .over(Window.partitionBy("id")                                                                  .orderBy("hour")))             .groupBy("id")             .agg(max($"collected").as("collected"))             .withColumn("hourly_count", mergeList($"collected"))             .select("id", "hourly_count").show

This keeps us within the DataFrame world. I also simplified the UDF code the OP was using.

Output:

+---+------------+ | id|hourly_count| +---+------------+ |id1|    12:55:44| |id2|    12:89:34| +---+------------+

109

answered Oct 11 '22 07:10

Adair

Related questions
                            
                                Map can not be serializable in scala?
                            
                                What is Scala's "powerful" type system?
                            
                                Getting object instance by string name in scala
                            
                                Play Framework: How to serialize/deserialize an enumeration to/from JSON
                            
                                SparkSQL: How to deal with null values in user defined function?
                            
                                How to pattern match on generic type in Scala?
                            
                                Scala Option - Getting rid of if (opt.isDefined) {}
                            
                                specific config by environment in Scala
                            
                                How can Scala receive multiple parameters in a method definition?
                            
                                How to show scala doc from Java Editor in Eclipse?
                            
                                Have "Brodal search trees" really been implemented for practical use?
                            
                                How do I access scala documentation from the repl?
                            
                                Spark / Scala: forward fill with last observation
                            
                                Spark final task takes 100x times longer than first 199, how to improve
                            
                                What does dollar sign do in scala
                            
                                What reflection capabilities can we expect from Scala 2.10?
                            
                                Integrating native system libraries with SBT
                            
                                Warnings while building Scala/Spark project with SBT
                            
                                Difference between Scala's existential types and Java's wildcard by example?
                            
                                Why is "abstract override" required not "override" alone in subtrait?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark DataFrame: does groupBy after orderBy maintain that order?

Tags:

scala

apache-spark

apache-spark-sql

spark-dataframe

spark-streaming

Ana Todor

People also ask

1 Answers

Adair

Recent Activity

Donate For Us