Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark filter out empty lists using .filter()

I have a pyspark dataframe where one column is filled with list, either containing entries or just empty lists. I want to efficiently filter out all rows that contain empty lists.

import pyspark.sql.functions as sf
df.filter(sf.col('column_with_lists') != []) 

returns me the following error:

Py4JJavaError: An error occurred while calling o303.notEqual.
: java.lang.RuntimeException: Unsupported literal type class

Perhaps I can check the length of the list and impose it should be > 0 (see here). However, I am unsure how this syntax works if I am using pyspark-sql and if filter even allows a lambda.

Perhaps to make clear, I have multiple columns but want to apply the above filter on a single one, removing all entries. The linked SO example filters on a single column.

Thanks in advance!

like image 337
gaatjeniksaan Avatar asked Feb 24 '17 11:02

gaatjeniksaan


1 Answers

So it appears it is as simple as using the size function from sql.functions:

import pyspark.sql.functions as sf
df.filter(sf.size('column_with_lists') > 0)
like image 91
gaatjeniksaan Avatar answered Jan 04 '23 06:01

gaatjeniksaan