I have the following pySpark dataframe:
+------------------+------------------+--------------------+--------------+-------+
| col1| col2| col3| X| Y|
+------------------+------------------+--------------------+--------------+-------+
|2.1729247374294496| 3.558069532647046| 6.607603368496324| 1| null|
|0.2654841575294071|1.2633077949463256|0.023578679968183733| 0| null|
|0.4253301781296708|3.4566490739823483| 0.11711202266039554| 3| null|
| 2.608497168338446| 3.529397129549324| 0.373034222141551| 2| null|
+------------------+------------------+--------------------+--------------+-------+
It is a rather simple operation and I can easily do it with pandas. However, I need to do it using only pySpark.
I want to do the following (I`ll write in sort of pseudocode):
In row where col3 == max(col3), change Y from null to 'K'
In the remaining rows, in the row where col1 == max(col1), change Y from null to 'Z'
In the remaining rows, in the row where col1 == min(col1), change Y from null to 'U'
In the remaining row: change Y from null to 'I'.
Therefore, the expected output is:
+------------------+------------------+--------------------+--------------+-------+
| col1| col2| col3| X| Y|
+------------------+------------------+--------------------+--------------+-------+
|2.1729247374294496| 3.558069532647046| 6.607603368496324| 1| K|
|0.2654841575294071|1.2633077949463256|0.023578679968183733| 0| U|
|0.4253301781296708|3.4566490739823483| 0.11711202266039554| 3| I|
| 2.608497168338446| 3.529397129549324| 0.373034222141551| 2| Z|
+------------------+------------------+--------------------+--------------+-------+
Having that done, I need to use this table as lookup for another table:
+--------------------+--------+-----+------------------+--------------+------------+
| x1| x2| x3| x4| X| d|
+--------------------+--------+-----+------------------+--------------+------------+
|0057f68a-6330-42a...| 2876| 30| 5.989999771118164| 0| 20171219|
|05cc0191-4ee4-412...| 108381| 34|24.979999542236328| 3| 20171219|
|06f353af-e9d3-4d0...| 118798| 34| 0.0| 3| 20171219|
|0c69b607-112b-4f3...| 20993| 34| 0.0| 0| 20171219|
|0d1b52ba-1502-4ff...| 23817| 34| 0.0| 0| 20171219|
I want to use the first table as lookup to create a new column in second table. The values for the new column should be looked up in column Y in first table using X column in second table as key (so we lookup values in column Y in first table corresponding to values in column X, and those values come from column X in second table).
UPD: I need a solution robust to one row satisfying two conditions, for example:
+------------------+------------------+--------------------+--------------+-------+
| col1| col2| col3| X| Y|
+------------------+------------------+--------------------+--------------+-------+
| 2.608497168338446| 3.558069532647046| 6.607603368496324| 1| null|
|0.2654841575294071|1.2633077949463256|0.023578679968183733| 0| null|
|0.4253301781296708|3.4566490739823483| 0.11711202266039554| 3| null|
|2.1729247374294496| 3.529397129549324| 0.373034222141551| 2| null|
+------------------+------------------+--------------------+--------------+-------+
In this case row 0 satisfies both max('col3') and max('col1') conditions.
So what needs to happen is this:
Row 0 becomes 'K'
Row 3 becomes 'Z' (because out of remaining rows (0 already has 'K' row 3 satisfies max('col1') condition
Row 1 becomes 'U'
Row 2 becomes 'I'
I cannot cannot have multiple rows in table 1 with 'I' in them.
Compute aggregates:
from pyspark.sql import functions as F
df = spark.createDataFrame([
(2.1729247374294496, 3.558069532647046, 6.607603368496324, 1),
(0.2654841575294071, 1.2633077949463256, 0.023578679968183733, 0),
(0.4253301781296708, 3.4566490739823483, 0.11711202266039554, 3),
(2.608497168338446, 3.529397129549324, 0.373034222141551, 2)
], ("col1", "col2", "col3", "x"))
min1, max1, max3 = df.select(F.min("col1"), F.max("col1"), F.max("col3")).first()
Add column with when
:
y = (F.when(F.col("col3") == max3, "K") # In row where col3 == max(col3), change Y from null to 'K'
.when(F.col("col1") == max1, "Z") # In the remaining rows, in the row where col1 == max(col1), change Y from null to 'Z'
.when(F.col("col1") == min1, "U") # In the remaining rows, in the row where col1 == min(col1), change Y from null to 'U'
.otherwise("I")) # In the remaining row: change Y from null to 'I'
df_with_y = df.withColumn("y", y)
df_with_y.show()
# +------------------+------------------+--------------------+---+---+
# | col1| col2| col3| x| y|
# +------------------+------------------+--------------------+---+---+
# |2.1729247374294496| 3.558069532647046| 6.607603368496324| 1| K|
# |0.2654841575294071|1.2633077949463256|0.023578679968183733| 0| U|
# |0.4253301781296708|3.4566490739823483| 0.11711202266039554| 3| I|
# | 2.608497168338446| 3.529397129549324| 0.373034222141551| 2| Z|
# +------------------+------------------+--------------------+---+---+
The values for the new column should be looked up in column Y in first table using X column in second table as key
df_with_y.select("x", "Y").join(df2, ["x"])
If y
already exists, and you to preserve not null values:
df_ = spark.createDataFrame([
(2.1729247374294496, 3.558069532647046, 6.607603368496324, 1, "G"),
(0.2654841575294071, 1.2633077949463256, 0.023578679968183733, 0, None),
(0.4253301781296708, 3.4566490739823483, 0.11711202266039554, 3, None),
(2.608497168338446, 3.529397129549324, 0.373034222141551, 2, None)
], ("col1", "col2", "col3", "x", "y"))
min1_, max1_, max3_ = df.filter(F.col("y").isNull()).select(F.min("col1"), F.max("col1"), F.max("col3")).first()
y_ = (F.when(F.col("col3") == max3_, "K")
.when(F.col("col1") == max1_, "Z")
.when(F.col("col1") == min1_, "U")
.otherwise("I"))
df_.withColumn("y", F.coalesce(F.col("y"), y_)).show()
# +------------------+------------------+--------------------+---+---+
# | col1| col2| col3| x| y|
# +------------------+------------------+--------------------+---+---+
# |2.1729247374294496| 3.558069532647046| 6.607603368496324| 1| G|
# |0.2654841575294071|1.2633077949463256|0.023578679968183733| 0| U|
# |0.4253301781296708|3.4566490739823483| 0.11711202266039554| 3| I|
# | 2.608497168338446| 3.529397129549324| 0.373034222141551| 2| K|
# +------------------+------------------+--------------------+---+---+
If you experience numerical precision issues you can try:
threshold = 0.0000001 # Choose appropriate
y_t = (F.when(F.abs(F.col("col3") - max3) < threshold, "K") # In row where col3 == max(col3), change Y from null to 'K'
.when(F.abs(F.col("col1") - max1) < threshold, "Z") # In the remaining rows, in the row where col1 == max(col1), change Y from null to 'Z'
.when(F.abs(F.col("col1") - min1) < threshold, "U") # In the remaining rows, in the row where col1 == min(col1), change Y from null to 'U'
.otherwise("I")) # In the remaining row: change Y from null to 'I'
df.withColumn("y", y_t).show()
# +------------------+------------------+--------------------+---+---+
# | col1| col2| col3| x| y|
# +------------------+------------------+--------------------+---+---+
# |2.1729247374294496| 3.558069532647046| 6.607603368496324| 1| K|
# |0.2654841575294071|1.2633077949463256|0.023578679968183733| 0| U|
# |0.4253301781296708|3.4566490739823483| 0.11711202266039554| 3| I|
# | 2.608497168338446| 3.529397129549324| 0.373034222141551| 2| Z|
# +------------------+------------------+--------------------+---+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With