Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

apply function to all values in array column pyspark

I want to make all values in an array column in my pyspark data frame negative without exploding (!). I tried this udf but it didn't work:

negative = func.udf(lambda x: x * -1, T.ArrayType(T.FloatType()))
cast_contracts = cast_contracts \
    .withColumn('forecast_values', negative('forecast_values'))

Can someone help?

Example data frame:

df = sc..parallelize(
   [Row(name='Joe', forecast_values=[1.0,2.0,3.0]),
    Row(name='Mary', forecast_values=[4.0,7.1])]).toDF()
>>> df.show()
    +----+---------------+
    |name|forecast_values|
    +----+---------------+
    | Joe|[1.0, 2.0, 3.0]|
    |Mary|     [4.0, 7.1]|
    +----+---------------+

Thanks

like image 796
LN_P Avatar asked Dec 31 '22 13:12

LN_P


2 Answers

I know this is a year old post and so the solution I'm about to give may not have been an option previously (it's new to Spark 3). If you're using spark 3.0 and above in the PySpark API, you should consider using spark.sql.function.transform inside pyspark.sql.functions.expr. Please don't confuse spark.sql.function.transform with PySpark's transform() chaining. At any rate, here is the solution:

df.withColumn("negative", F.expr("transform(forecast_values, x -> x * -1)"))

Only thing you need to make sure is convert the values to int or float. The approach highlighted is much more efficient than exploding array or using UDFs.

like image 145
mrammah Avatar answered Jan 03 '23 03:01

mrammah


It's just that you're not looping over the list values to multiply them with -1

import pyspark.sql.functions as F
import pyspark.sql.types as T

negative = F.udf(lambda x: [i * -1 for i in x], T.ArrayType(T.FloatType()))
cast_contracts = df \
    .withColumn('forecast_values', negative('forecast_values'))

You cannot escape the udf but the best possible way to do this. Works better if you have large lists:

import numpy as np

negative = F.udf(lambda x: np.negative(x).tolist(), T.ArrayType(T.FloatType()))
cast_contracts = abdf \
    .withColumn('forecast_values', negative('forecast_values'))
cast_contracts.show()
+------------------+----+
|   forecast_values|name|
+------------------+----+
|[-1.0, -2.0, -3.0]| Joe|
|            [-3.0]|Mary|
|      [-4.0, -7.1]|Mary|
+------------------+----+
like image 20
pissall Avatar answered Jan 03 '23 05:01

pissall