I have a sample DF, trying to replace the list of column values with ascending sorted index:
DF:
df = pd.DataFrame(np.random.randint(0,10,size=(7,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Mango","Mango","Mango","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
a b c d1 d2 date
0 2 7 9 Apple Orange 2002-01-01
1 6 0 9 Mango lemon 2002-01-01
2 8 0 0 Apple lemon 2002-01-01
3 4 4 4 Mango Orange 2002-01-01
4 5 0 8 Mango lemon 2002-02-01
5 6 1 6 Mango Orange 2002-02-01
6 7 2 7 Apple lemon 2002-02-01
Step 1:
Group the DF by "date" column, sample group on "2002-01-01"
a b c d1 d2 date
0 2 7 9 Apple Orange 2002-01-01
1 6 0 9 Mango lemon 2002-01-01
2 8 0 0 Apple lemon 2002-01-01
3 4 4 4 Mango Orange 2002-01-01
Step 2:
In this group, replace the values of columns ["d1","d2"]
with index (not the DF index) of sorted mean values based on c
.
For example in the above group mean(c, d1="Apple") = [9+0]/2 => 4.5
and
mean(c, d1="Mango") = [9+4]/2 => 6.5
so the ascending sorted index
is Apple:0
and Mango:1
so the value of column d1
will be replaced like the following:
a b c d1 d2 date
0 2 7 9 0 Orange 2002-01-01
1 6 0 9 1 lemon 2002-01-01
2 8 0 0 0 lemon 2002-01-01
3 4 4 4 1 Orange 2002-01-01
Apply this for the entire df
. I have a brute force approach of iterating through the groups and each rows, any suggestions for a more pandas
based solution will be helpful in improving the efficiency.
Is this what you are looking for in column d1? You could apply some similar technique to d2 as well. Its not the most elegant solution though.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(7,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Mango","Mango","Mango","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
df['mean_value'] = df.groupby(['date', 'd1'])['c'].transform(lambda x: np.mean(x))
df['rank_value'] = (df.groupby(['date'])['mean_value'].rank(ascending=True, method='dense') - 1).astype(int)
df['d1'] = df['rank_value']
df.drop(labels=['rank_value', 'mean_value'], axis=1, inplace=True)
df
a b c d1 d2 date
0 3 1 4 1 Orange 2002-01-01
1 9 7 5 0 lemon 2002-01-01
2 9 9 5 1 lemon 2002-01-01
3 8 1 2 0 Orange 2002-01-01
4 8 0 1 0 lemon 2002-02-01
5 1 8 3 0 Orange 2002-02-01
6 8 0 4 1 lemon 2002-02-01
You may use pivot_table
and groupby.rank
to create rank. After that use map
to assign values back
df1 = df.pivot_table('c', ['date','d1']).groupby(level=0).rank(method='dense')-1
df['d1'] = df[['date','d1']].agg(tuple, axis=1).map(df1.c).astype('int')
Out[255]:
a b c d1 d2 date
0 2 7 9 0 Orange 2002-01-01
1 6 0 9 1 lemon 2002-01-01
2 8 0 0 0 lemon 2002-01-01
3 4 4 4 1 Orange 2002-01-01
4 5 0 8 0 lemon 2002-02-01
5 6 1 6 0 Orange 2002-02-01
6 7 2 7 0 lemon 2002-02-01
Note: group 2002-02-01
has the same mean value 7
for both Mango
and Apple
so the rank is all 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With