I am working on a dataframe df
, for instance the following dataframe:
df.show()
Output:
+----+------+
|keys|values|
+----+------+
| aa| apple|
| bb|orange|
| bb| desk|
| bb|orange|
| bb| desk|
| aa| pen|
| bb|pencil|
| aa| chair|
+----+------+
I use collect_set
to aggregate and get a set of objects with duplicate elements eliminated (or collect_list
to get list of objects).
df_new = df.groupby('keys').agg(collect_set(df.values).alias('collectedSet_values'))
The resulting dataframe is then as follows:
df_new.show()
Output:
+----+----------------------+
|keys|collectedSet_values |
+----+----------------------+
|bb |[orange, pencil, desk]|
|aa |[apple, pen, chair] |
+----+----------------------+
I am struggling to find a way to see if a specific keyword (like 'chair') is in the resulting set of objects (in column collectedSet_values
). I do not want to go with udf
solution.
Please comment your solutions/ideas.
Kind Regards.
monotonically_increasing_id ()[source] A column that generates monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
describe (*cols)[source] Computes basic statistics for numeric and string columns. New in version 1.3. 1. This include count, mean, stddev, min, and max.
The Spark function collect_list() is used to aggregate the values into an ArrayType typically after group by and window partition.
DataFrame. selectExpr (*expr)[source] Projects a set of SQL expressions and returns a new DataFrame . This is a variant of select() that accepts SQL expressions.
Actually there is a nice function array_contains
which does that for us. The way we use it for set of objects is the same as in here. To know if word 'chair' exists in each set of object, we can simply do the following:
df_new.withColumn('contains_chair', array_contains(df_new.collectedSet_values, 'chair')).show()
Output:
+----+----------------------+--------------+
|keys|collectedSet_values |contains_chair|
+----+----------------------+--------------+
|bb |[orange, pencil, desk]|false |
|aa |[apple, pen, chair] |true |
+----+----------------------+--------------+
The same applies to the result of collect_list
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With