I've a dataframe with scores for each offer for each contact. I want to to create a new dataframe out of this which has the top 3 offers for each contact.
The input dataframe is something like this:
=======================================================================
| contact | offer 1 | offer 2 | offer 3 | offer 4 | offer 5 | offer 6 |
=======================================================================
| name 1 | 0 | 3 | 1 | 2 | 1 | 6 |
-----------------------------------------------------------------------
| name 2 | 1 | 7 | 2 | 9 | 5 | 3 |
-----------------------------------------------------------------------
I want to convert it to dataframe like this:
===============================================================
| contact | best offer | second best offer | third best offer |
===============================================================
| name 1 | offer 6 | offer 2 | offer 4 |
---------------------------------------------------------------
| name 1 | offer 4 | offer 2 | offer 5 |
---------------------------------------------------------------
You'll need a few imports:
from pyspark.sql.functions import array, col, lit, sort_array, struct
With data as shown in the question:
df = sc.parallelize([
("name 1", 0, 3, 1, 2, 1, 6),
("name 2", 1, 7, 2, 9, 5, 3),
]).toDF(["contact"] + ["offer_{}".format(i) for i in range(1, 7)])
you can assemble and sort an array of structs
:
offers = sort_array(array(*[
struct(col(c).alias("v"), lit(c).alias("k")) for c in df.columns[1:]
]), asc=False)
and select
:
df.select(
["contact"] + [offers[i]["k"].alias("_{}".format(i)) for i in [0, 1, 2]])
which should give the following result:
+-------+-------+-------+-------+
|contact| _0| _1| _2|
+-------+-------+-------+-------+
| name 1|offer_6|offer_2|offer_4|
| name 2|offer_4|offer_2|offer_5|
+-------+-------+-------+-------+
Rename the columns according to your needs and you're ready to go.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With