pyspark; check if an element is in collect_list [duplicate]

Tags:

I am working on a dataframe df, for instance the following dataframe:

df.show()

Output:

+----+------+
|keys|values|
+----+------+
|  aa| apple|
|  bb|orange|
|  bb|  desk|
|  bb|orange|
|  bb|  desk|
|  aa|   pen|
|  bb|pencil|
|  aa| chair|
+----+------+

I use collect_set to aggregate and get a set of objects with duplicate elements eliminated (or collect_list to get list of objects).

df_new = df.groupby('keys').agg(collect_set(df.values).alias('collectedSet_values'))

The resulting dataframe is then as follows:

df_new.show()

Output:

+----+----------------------+
|keys|collectedSet_values   |
+----+----------------------+
|bb  |[orange, pencil, desk]|
|aa  |[apple, pen, chair]   |
+----+----------------------+

I am struggling to find a way to see if a specific keyword (like 'chair') is in the resulting set of objects (in column collectedSet_values). I do not want to go with udf solution.

Please comment your solutions/ideas.

Kind Regards.

513

asked Jul 24 '18 13:07

Ala Tarighati

1 Answers

Actually there is a nice function array_contains which does that for us. The way we use it for set of objects is the same as in here. To know if word 'chair' exists in each set of object, we can simply do the following:

df_new.withColumn('contains_chair', array_contains(df_new.collectedSet_values, 'chair')).show()

Output:

+----+----------------------+--------------+
|keys|collectedSet_values   |contains_chair|
+----+----------------------+--------------+
|bb  |[orange, pencil, desk]|false         |
|aa  |[apple, pen, chair]   |true          |
+----+----------------------+--------------+

The same applies to the result of collect_list.

123

answered Oct 22 '22 05:10

Ala Tarighati

Related questions
                            
                                'Connection Refused' error while running Spark Streaming on local machine
                            
                                Spark write Parquet to S3 the last task takes forever
                            
                                What is the difference between Spark DataSet and RDD
                            
                                In Spark is counting the records in an RDD expensive task?
                            
                                YARN: What is the difference between number-of-executors and executor-cores in Spark?
                            
                                Difference between QuantileDiscretizer and Bucketizer in Spark
                            
                                How to know which count query is the fastest?
                            
                                pyspark -- best way to sum values in column of type Array(Integer())
                            
                                Spark Configuration: memory/instance/cores
                            
                                PySpark reduceByKey? to add Key/Tuple
                            
                                Spark and SparkSQL: How to imitate window function?
                            
                                How to check that the SparkContext has been stopped?
                            
                                How to find the nearest neighbors of 1 Billion records with Spark?
                            
                                update query in Spark SQL
                            
                                Pyspark: TaskMemoryManager: Failed to allocate a page: Need help in Error Analysis
                            
                                How to Stop running Spark Streaming application Gracefully?
                            
                                Get Last Monday in Spark
                            
                                Spark application kills executor
                            
                                How to restart Spark service in EMR after changing conf settings?
                            
                                Why accesing DataFrame from UDF results in NullPointerException?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pyspark; check if an element is in collect_list [duplicate]

Tags:

apache-spark

apache-spark-sql

pyspark

Ala Tarighati

People also ask

1 Answers

Ala Tarighati

Recent Activity

Donate For Us