You can bucket the ages by this formula range_start = age - (age % interval)
Here is interval
is 5
Creating test dataframe for demo
val df = (100 to 400 by 7).map(id => (s"user$id", id % 60))
.toDF("name", "age")
df.show(false)
+-------+---+
|name |age|
+-------+---+
|user100|40 |
|user107|47 |
|user114|54 |
|user121|1 |
|user128|8 |
|user135|15 |
|user142|22 |
|user149|29 |
|user156|36 |
|user163|43 |
|user170|50 |
|user177|57 |
|user184|4 |
|user191|11 |
|user198|18 |
|user205|25 |
|user212|32 |
|user219|39 |
|user226|46 |
|user233|53 |
+-------+---+
only showing top 20 rows
Bucket the ages by interval
import org.apache.spark.sql.functions._
val interval = 5
df.withColumn("range", $"age" - ($"age" % interval))
.withColumn("range", concat($"range", lit(" - "), $"range" + interval)) //optional one
.groupBy($"range")
.agg(collect_list($"name").as("names")) //change it to needed agg function or anything
.show(false)
+--------+------------------------------------+
|range |names |
+--------+------------------------------------+
|10 to 15|[user191, user254, user310, user373]|
|50 to 55|[user114, user170, user233, user352]|
|5 to 10 |[user128, user247, user366] |
|55 to 60|[user177, user296, user359] |
|45 to 50|[user107, user226, user289, user345]|
|35 to 40|[user156, user219, user275, user338]|
|25 to 30|[user149, user205, user268, user387]|
|30 to 35|[user212, user331, user394] |
|0 to 5 |[user121, user184, user240, user303]|
|20 to 25|[user142, user261, user324, user380]|
|15 to 20|[user135, user198, user317] |
|40 to 45|[user100, user163, user282] |
+--------+------------------------------------+
We can even use UDF also with same formula but that might be slightly slow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With