SPARK DataFrame: How to efficiently split dataframe for each group based on same column values

Tags:

I have a DataFrame generated as follows:

df.groupBy($"Hour", $"Category")
  .agg(sum($"value").alias("TotalValue"))
  .sort($"Hour".asc,$"TotalValue".desc))

The results look like:

+----+--------+----------+
|Hour|Category|TotalValue|
+----+--------+----------+
|   0|   cat26|      30.9|
|   0|   cat13|      22.1|
|   0|   cat95|      19.6|
|   0|  cat105|       1.3|
|   1|   cat67|      28.5|
|   1|    cat4|      26.8|
|   1|   cat13|      12.6|
|   1|   cat23|       5.3|
|   2|   cat56|      39.6|
|   2|   cat40|      29.7|
|   2|  cat187|      27.9|
|   2|   cat68|       9.8|
|   3|    cat8|      35.6|
| ...|    ....|      ....|
+----+--------+----------+

I would like to make new dataframes based on every unique value of col("Hour") , i.e.

for the group of Hour==0
for the group of Hour==1
for the group of Hour==2 and so on...

So the desired output would be:

df0 as:

+----+--------+----------+
|Hour|Category|TotalValue|
+----+--------+----------+
|   0|   cat26|      30.9|
|   0|   cat13|      22.1|
|   0|   cat95|      19.6|
|   0|  cat105|       1.3|
+----+--------+----------+

df1 as:
+----+--------+----------+
|Hour|Category|TotalValue|
+----+--------+----------+
|   1|   cat67|      28.5|
|   1|    cat4|      26.8|
|   1|   cat13|      12.6|
|   1|   cat23|       5.3|
+----+--------+----------+

and similarly,

df2 as:

+----+--------+----------+
|Hour|Category|TotalValue|
+----+--------+----------+
|   2|   cat56|      39.6|
|   2|   cat40|      29.7|
|   2|  cat187|      27.9|
|   2|   cat68|       9.8|
+----+--------+----------+

Any help is highly appreciated.

EDIT 1:

What I have tried:

df.foreach(
  row => splitHour(row)
  )

def splitHour(row: Row) ={
    val Hour=row.getAs[Long]("Hour")

    val HourDF= sparkSession.createDataFrame(List((s"$Hour",1)))

    val hdf=HourDF.withColumnRenamed("_1","Hour_unique").drop("_2")

    val mydf: DataFrame =df.join(hdf,df("Hour")===hdf("Hour_unique"))

    mydf.write.mode("overwrite").parquet(s"/home/dev/shaishave/etc/myparquet/$Hour/")
  }

PROBLEM WITH THIS STRATEGY:

It took 8 hours when it was run on a dataframe df which had over 1 million rows and spark job was given around 10 GB RAM on single node. So, join is turning out to be highly in-efficient.

Caveat: I have to write each dataframe mydf as parquet which has nested schema that is required to be maintained (not flattened).

226

asked Jan 15 '17 17:01

shubham rajput

2 Answers

As noted in my comments, one potentially easy approach to this problem would be to use:

df.write.partitionBy("hour").saveAsTable("myparquet")

As noted, the folder structure would be myparquet/hour=1, myparquet/hour=2, ..., myparquet/hour=24 as opposed to myparquet/1, myparquet/2, ..., myparquet/24.

To change the folder structure, you could

Potentially use the Hive configuration setting hcat.dynamic.partitioning.custom.pattern within an explicit HiveContext; more information at HCatalog DynamicPartitions.
Another approach would be to change the file system directly after you have executed the df.write.partitionBy.saveAsTable(...) command with something like for f in *; do mv $f ${f/${f:0:5}/} ; done which would remove the Hour= text from the folder name.

It is important to note that by changing the naming pattern for the folders, when you are running spark.read.parquet(...) in that folder, Spark will not automatically understand the dynamic partitions since its missing the partitionKey (i.e. Hour) information.

answered Oct 22 '22 17:10

Denny Lee

Another possible solution:

df.write.mode("overwrite").partitionBy("hour").parquet("address/to/parquet/location")

This is similar to the first answer except using parquet and using mode("overwrite").

answered Oct 22 '22 16:10

A.M.

Related questions
                            
                                How to write copy() method for Simple Class in Scala
                            
                                Spark Kryo: Register a custom serializer
                            
                                Spark ML VectorAssembler returns strange output
                            
                                Circe - Use default fields in case class when decoding/encoding json
                            
                                How to generate datasets dynamically based on schema?
                            
                                Making sense of Scala development tools
                            
                                Why is currying and uncurrying not implicit in scala
                            
                                How do I add an XML attribute, or not, depending on an Option?
                            
                                Why no partial function type literal?
                            
                                Scala Play framework: Duplicate mappings of compiled css files
                            
                                How do I create an empty immutable Scala map in Java?
                            
                                How to make SBT not reporting compilation warnings for generated code?
                            
                                Companion class requires import of Companion object methods and nested objects?
                            
                                Scalatest: waiting for an assertion to become true
                            
                                Scala String interpolation with Format, how to change locale?
                            
                                Why can't I flatMap a Try?
                            
                                Why does IDEA report errors for build.sbt in a new sbt project?
                            
                                Tell Swagger that the request body can be a single object or a list of objects
                            
                                scala - parse json of more than 22 elements into case class
                            
                                Spark job with Async HTTP call

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SPARK DataFrame: How to efficiently split dataframe for each group based on same column values

Tags:

scala

apache-spark

apache-spark-sql

parquet

spark-dataframe

shubham rajput

People also ask

2 Answers

Denny Lee

A.M.

Recent Activity

Donate For Us