Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark; check if an element is in collect_list [duplicate]

I am working on a dataframe df, for instance the following dataframe:

df.show()

Output:

+----+------+
|keys|values|
+----+------+
|  aa| apple|
|  bb|orange|
|  bb|  desk|
|  bb|orange|
|  bb|  desk|
|  aa|   pen|
|  bb|pencil|
|  aa| chair|
+----+------+

I use collect_set to aggregate and get a set of objects with duplicate elements eliminated (or collect_list to get list of objects).

df_new = df.groupby('keys').agg(collect_set(df.values).alias('collectedSet_values'))

The resulting dataframe is then as follows:

df_new.show()

Output:

+----+----------------------+
|keys|collectedSet_values   |
+----+----------------------+
|bb  |[orange, pencil, desk]|
|aa  |[apple, pen, chair]   |
+----+----------------------+

I am struggling to find a way to see if a specific keyword (like 'chair') is in the resulting set of objects (in column collectedSet_values). I do not want to go with udf solution.

Please comment your solutions/ideas.

Kind Regards.

like image 513
Ala Tarighati Avatar asked Jul 24 '18 13:07

Ala Tarighati


People also ask

What is Monotonically_increasing_id () in spark?

monotonically_increasing_id ()[source] A column that generates monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

What does describe () do in PySpark?

describe (*cols)[source] Computes basic statistics for numeric and string columns. New in version 1.3. 1. This include count, mean, stddev, min, and max.

What does Collect_list do in spark?

The Spark function collect_list() is used to aggregate the values into an ArrayType typically after group by and window partition.

What is selectExpr in PySpark?

DataFrame. selectExpr (*expr)[source] Projects a set of SQL expressions and returns a new DataFrame . This is a variant of select() that accepts SQL expressions.


1 Answers

Actually there is a nice function array_contains which does that for us. The way we use it for set of objects is the same as in here. To know if word 'chair' exists in each set of object, we can simply do the following:

df_new.withColumn('contains_chair', array_contains(df_new.collectedSet_values, 'chair')).show()

Output:

+----+----------------------+--------------+
|keys|collectedSet_values   |contains_chair|
+----+----------------------+--------------+
|bb  |[orange, pencil, desk]|false         |
|aa  |[apple, pen, chair]   |true          |
+----+----------------------+--------------+

The same applies to the result of collect_list.

like image 123
Ala Tarighati Avatar answered Oct 22 '22 05:10

Ala Tarighati