I am using pyspark 2.3.1 and would like to filter array elements with an expression and not an using udf:
>>> df = spark.createDataFrame([(1, "A", [1,2,3,4]), (2, "B", [1,2,3,4,5])],["col1", "col2", "col3"])
>>> df.show()
+----+----+---------------+
|col1|col2| col3|
+----+----+---------------+
| 1| A| [1, 2, 3, 4]|
| 2| B|[1, 2, 3, 4, 5]|
+----+----+---------------+
The expreesion shown below is wrong, I wonder how to tell spark to remove out any values from the array in col3 which are smaller than 3. I want something like:
>>> filtered = df.withColumn("newcol", expr("filter(col3, x -> x >= 3)")).show()
>>> filtered.show()
+----+----+---------+
|col1|col2| newcol|
+----+----+---------+
| 1| A| [3, 4]|
| 2| B|[3, 4, 5]|
+----+----+---------+
I have already an udf solution, but it is very slow (> 1 billions data rows):
largerThan = F.udf(lambda row,max: [x for x in row if x >= max], ArrayType(IntegerType()))
df = df.withColumn('newcol', size(largerThan(df.queries, lit(3))))
Any help is welcome. Thank you very much in advance.
Spark < 2.4
There is no *reasonable replacement for udf
in PySpark.
Spark >= 2.4
Your code:
expr("filter(col3, x -> x >= 3)")
can be used as is.
Reference
Querying Spark SQL DataFrame with complex types
* Given the cost of exploding or converting to and from RDD udf
is almost exclusively preferable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With