I have a loop generating an output of several tables of factors as well as storing the column name in a list:
| id | f_1a | f_2a |
|:---|:----:|:-----|
|1 |1.2 |0.95 |
|2 |0.7 |0.87 |
|3 |1.2 |1.4 |
col_lst = ['f1_a','f2_a']
| id | f_1b | f_2b | f_3b |
|:---|:----:|:-----|:-----|
|1 |1.6 |1.2 | 0.98 |
|2 |0.9 |0.65 | 1.7 |
|3 |1.1 |1.33 | 1.4 |
col_lst = ['f1_b','f2_b','f_3b']
I'm having difficulty figuring out a code with Pyspark that would allow me to create a new column that contains the product of the listed columns per table such that:
| id | f_1a | f_2a | f_a |
|:---|:----:|:-----|:----|
|1 |1.2 |0.95 |1.14 |
|2 |0.7 |0.87 |0.61 |
|3 |1.2 |1.4 |1.68 |
| id | f_1b | f_2b | f_3b | f_b |
|:---|:----:|:-----|:-----|:-----|
|1 |1.6 |1.2 | 0.98 | 1.88 |
|2 |0.9 |0.65 | 1.7 | 1 |
|3 |1.1 |1.33 | 1.4 | 2.05 |
Any help would be greatly appreciated
Use reduce to apply a unanimous function that multiplies column values row wise.
df=spark.createDataFrame([(1 ,1.6 ,1.2 , 0.98) ,
(2 ,0.9 ,0.65 , 1.7 ) ,
(3 ,1.1 ,1.33 , 1.4) ] ,
('id' , 'f_1b' , 'f_2b' , 'f_3b' ))
df.show()
solution
df.withColumn('f_b', reduce(lambda a,b: round(a*b,2),[F.col(c) for c in df.drop('id').columns])).show()
outcome
+---+----+----+----+----+
| id|f_1b|f_2b|f_3b| f_b|
+---+----+----+----+----+
| 1| 1.6| 1.2|0.98|1.88|
| 2| 0.9|0.65| 1.7| 1.0|
| 3| 1.1|1.33| 1.4|2.04|
+---+----+----+----+----+
Here is another way using an expression:
First create your col_list
col_lst = ['f_1b','f_2b','f_3b']
Or
col_lst = [col for col in df.columns if col!='id']
Then:
from pyspark.sql import functions as F
df.withColumn("fb",F.round(F.expr("*".join(col_lst)),2)).show()
+---+----+----+----+----+
| id|f_1b|f_2b|f_3b| fb|
+---+----+----+----+----+
| 1| 1.6| 1.2|0.98|1.88|
| 2| 0.9|0.65| 1.7|0.99|
| 3| 1.1|1.33| 1.4|2.05|
+---+----+----+----+----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With