How to limit functions.collect_set in Spark SQL?

Tags:

apache-spark-sql

I'm dealing with a column of numbers in a large spark DataFrame, and I would like to create a new column that stores an aggregated list of unique numbers that appear in that column.

Basically exactly what functions.collect_set does. However, i only need up to 1000 elements in the aggregated list. Is there any way to pass that parameter somehow to functions.collect_set(), or any other way to get only up to 1000 elements in the aggregated list, without using a UDAF?

Since the column is so large, I'd like to avoid collecting all elements and trimming the list afterwards.

Thanks!

629

asked Aug 02 '16 21:08

user1500142

1 Answers

Spark 2.4

As pointed out in a comment, Spark 2.4.0 comes with slice standard function which can do this sort of thing.

val usage = sql("describe function slice").as[String].collect()(2)
scala> println(usage)
Usage: slice(x, start, length) - Subsets array x starting from index start (array indices start at 1, or starting from the end if start is negative) with the specified length.

That gives the following query:

val q = input
  .groupBy('key)
  .agg(collect_set('id) as "collect")
  .withColumn("three_only", slice('collect, 1, 3))
scala> q.show(truncate = false)
+---+--------------------------------------+------------+
|key|collect                               |three_only  |
+---+--------------------------------------+------------+
|0  |[0, 15, 30, 45, 5, 20, 35, 10, 25, 40]|[0, 15, 30] |
|1  |[1, 16, 31, 46, 6, 21, 36, 11, 26, 41]|[1, 16, 31] |
|3  |[33, 48, 13, 38, 3, 18, 28, 43, 8, 23]|[33, 48, 13]|
|2  |[12, 27, 37, 2, 17, 32, 42, 7, 22, 47]|[12, 27, 37]|
|4  |[9, 19, 34, 49, 24, 39, 4, 14, 29, 44]|[9, 19, 34] |
+---+--------------------------------------+------------+

Before Spark 2.4

I'd use a UDF that would do what you want after collect_set (or collect_list) or a much harder UDAF.

Given more experience with UDFs, I'd go with that first. Even though UDFs are not optimized, for this use case it's fine.

val limitUDF = udf { (nums: Seq[Long], limit: Int) => nums.take(limit) }
val sample = spark.range(50).withColumn("key", $"id" % 5)

scala> sample.groupBy("key").agg(collect_set("id") as "all").show(false)
+---+--------------------------------------+
|key|all                                   |
+---+--------------------------------------+
|0  |[0, 15, 30, 45, 5, 20, 35, 10, 25, 40]|
|1  |[1, 16, 31, 46, 6, 21, 36, 11, 26, 41]|
|3  |[33, 48, 13, 38, 3, 18, 28, 43, 8, 23]|
|2  |[12, 27, 37, 2, 17, 32, 42, 7, 22, 47]|
|4  |[9, 19, 34, 49, 24, 39, 4, 14, 29, 44]|
+---+--------------------------------------+

scala> sample.
  groupBy("key").
  agg(collect_set("id") as "all").
  withColumn("limit(3)", limitUDF($"all", lit(3))).
  show(false)
+---+--------------------------------------+------------+
|key|all                                   |limit(3)    |
+---+--------------------------------------+------------+
|0  |[0, 15, 30, 45, 5, 20, 35, 10, 25, 40]|[0, 15, 30] |
|1  |[1, 16, 31, 46, 6, 21, 36, 11, 26, 41]|[1, 16, 31] |
|3  |[33, 48, 13, 38, 3, 18, 28, 43, 8, 23]|[33, 48, 13]|
|2  |[12, 27, 37, 2, 17, 32, 42, 7, 22, 47]|[12, 27, 37]|
|4  |[9, 19, 34, 49, 24, 39, 4, 14, 29, 44]|[9, 19, 34] |
+---+--------------------------------------+------------+

See functions object (for udf function's docs).

134

answered Sep 17 '22 15:09

Jacek Laskowski

Related questions
                            
                                Spark is only using one worker machine when more are available
                            
                                If I cache a Spark Dataframe and then overwrite the reference, will the original data frame still be cached?
                            
                                Output from Dataproc Spark job in Google Cloud Logging
                            
                                Read and write empty string "" vs NULL in Spark 2.0.1
                            
                                Apache Spark - Dealing with Sliding Windows on Temporal RDDs
                            
                                Caching intermediate results in Spark ML pipeline
                            
                                What is the correct way to start/stop spark streaming jobs in yarn?
                            
                                Spark Java Error: Size exceeds Integer.MAX_VALUE
                            
                                Dealing with a large gzipped file in Spark
                            
                                Extract document-topic matrix from Pyspark LDA Model
                            
                                local class incompatible Exception: when running spark standalone from IDE
                            
                                Disadvantages of Spark Dataset over DataFrame
                            
                                Why spark.ml don't implement any of spark.mllib algorithms?
                            
                                Preserve index-string correspondence spark string indexer
                            
                                How can set the default spark logging level?
                            
                                Meaning of Apache Spark warning "Calling spill() on RowBasedKeyValueBatch"
                            
                                Why is dataset.count causing a shuffle! (spark 2.2)
                            
                                Extract information from a `org.apache.spark.sql.Row`
                            
                                What is the right way to save\load models in Spark\PySpark
                            
                                How to run independent transformations in parallel using PySpark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With