PySpark - pass a value from another column as the parameter of spark function

Tags:

I have a spark dataframe which looks like this where expr is SQL/Hive filter expression.

+-----------------------------------------+
|expr                     |var1     |var2 |
+-------------------------+---------+-----+
|var1 > 7                 |9        |0    |
|var1 > 7                 |9        |0    |
|var1 > 7                 |9        |0    |
|var1 > 7                 |9        |0    |
|var1 = 3 AND var2 >= 0   |9        |0    |
|var1 = 3 AND var2 >= 0   |9        |0    |
|var1 = 3 AND var2 >= 0   |9        |0    |
|var1 = 3 AND var2 >= 0   |9        |0    |
|var1 = 2 AND var2 >= 0   |9        |0    |
+-------------------------+---------+-----+

I want to transform this dataframe to the dataframe below where flag is the boolean value found after evaluating the expression in column 'expr'

+---------------------------------------------------+
|expr                     |var1     |var2 |flag     |
+-------------------------+---------+-----+---------+
|var1 > 7                 |9        |0    |  True   |
|var1 > 7                 |9        |0    |  True   |
|var1 > 7                 |9        |0    |  True   |
|var1 > 7                 |9        |0    |  True   |
|var1 = 3 AND var2 >= 0   |9        |0    |     .   |
|var1 = 3 AND var2 >= 0   |9        |0    |     .   |
|var1 = 3 AND var2 >= 0   |9        |0    |     .   |
|var1 = 3 AND var2 >= 0   |9        |0    |     .   |
|var1 = 2 AND var2 >= 0   |9        |0    |     .   |
+-------------------------+---------+-----+---------+

I have tried using expr function like this:

df.withColumn('flag', expr(col('expr')))

It will fail as expected because expr function expects a string as parameter.

Another idea I thought of using is making a UDF and passing the 'expr' column's value to it, but that will not allow me to use the expr function of pyspark because UDFs are all non-spark code.

What should my approach be? Any suggestions please?

206

asked Jun 19 '20 21:06

UtkarshSahu

1 Answers

So here's a PySpark solution without a UDF. In Scala I believe you could use map or foldleft with the same logic.

exprs = df.select('expr').distinct().collect()[0][0]

for ex in exprs:
    df = df.withColumn('test', when(col('expr') == lit(ex), expr(ex)))
    
df.show()

+--------------------+----+----+----+
|                expr|var1|var2|test|
+--------------------+----+----+----+
|            var1 > 7|   9|   0|true|
|            var1 > 7|   9|   0|true|
|            var1 > 7|   9|   0|true|
|            var1 > 7|   9|   0|true|
|var1 = 3 AND var2...|   9|   0|null|
|var1 = 3 AND var2...|   9|   0|null|
|var1 = 3 AND var2...|   9|   0|null|
|var1 = 3 AND var2...|   9|   0|null|
|var1 = 2 AND var2...|   9|   0|null|
+--------------------+----+----+----+

I should point out that I don't understand why the OP wants to do this, if they provide better context to the problem I'm sure there's a better way.

Iterating over a DF isn't the most efficient thing to do, but in this case it will actually work very fast as it doesn't iterate over the data so Spark will actually execute it within one plan. Also a single collect() only adds 2 seconds to the execution time on a 20+ million DF.

UPDATE:

I understand the problem a bit better now, this will be faster as Spark will calculate all of the filters at once before coalescing them into one column.

# Tip: perform the collect statement on the smaller DF that contains the filter expressions

exprs = df.select('expr').distinct().collect()[0][0]

df = df.withColumn('filter',
              coalesce(*[when(col('expr') == lit(ex), expr(ex)) for ex in exprs])
             )
df.show()

+--------------------+----+----+------+
|                expr|var1|var2|filter|
+--------------------+----+----+------+
|            var1 > 7|   9|   0|true  |
|            var1 > 7|   9|   0|true  |
|            var1 > 7|   9|   0|true  |
|            var1 > 7|   9|   0|true  |
|var1 = 3 AND var2...|   9|   0|null  |
|var1 = 3 AND var2...|   9|   0|null  |
|var1 = 3 AND var2...|   9|   0|null  |
|var1 = 3 AND var2...|   9|   0|null  |
|var1 = 2 AND var2...|   9|   0|null  |
+--------------------+----+----+------+

131

answered Sep 28 '22 07:09

Topde

Related questions
                            
                                Remove special character from a column in dataframe
                            
                                Spark Dataframe hanging on save
                            
                                SparkR DataFrame partitioning issue
                            
                                spark-shell: strange behavior with import
                            
                                ERROR WHILE RUNNING collect() in PYSPARK
                            
                                Stateful udfs in spark sql, or how to obtain mapPartitions performance benefit in spark sql?
                            
                                Continuous trigger not found in Structured Streaming
                            
                                Cannot load pipeline model from pyspark
                            
                                prioritizing partitions / task execution in spark
                            
                                How to skip multiple lines using read.csv in PySpark
                            
                                AWS EMR 5.20 and Java version support
                            
                                PySpark 2.x: Programmatically adding Maven JAR Coordinates to Spark
                            
                                Spark structured streaming exactly once - Not achieved - Duplicated events
                            
                                When to use a UDF versus a function in PySpark? [duplicate]
                            
                                How to apply large python model to pyspark-dataframe?
                            
                                Spark Caused by: java.lang.StackOverflowError Window Function?
                            
                                JDBC to Spark Dataframe - How to ensure even partitioning?
                            
                                Pyspark Window function on entire data frame
                            
                                Spark Structured Streaming with Kafka SASL/PLAIN authentication
                            
                                Job 65 cancelled because SparkContext was shut down

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark - pass a value from another column as the parameter of spark function

Tags:

apache-spark

apache-spark-sql

pyspark

UtkarshSahu

People also ask

1 Answers

Topde

Recent Activity

Donate For Us