Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace pandas column with sorted index

I have a sample DF, trying to replace the list of column values with ascending sorted index:

DF:

df = pd.DataFrame(np.random.randint(0,10,size=(7,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Mango","Mango","Mango","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])

    a   b   c    d1      d2       date
0   2   7   9   Apple   Orange  2002-01-01
1   6   0   9   Mango   lemon   2002-01-01
2   8   0   0   Apple   lemon   2002-01-01
3   4   4   4   Mango   Orange  2002-01-01
4   5   0   8   Mango   lemon   2002-02-01
5   6   1   6   Mango   Orange  2002-02-01
6   7   2   7   Apple   lemon   2002-02-01

Step 1:

Group the DF by "date" column, sample group on "2002-01-01"


        a   b   c    d1      d2       date
    0   2   7   9   Apple   Orange  2002-01-01
    1   6   0   9   Mango   lemon   2002-01-01
    2   8   0   0   Apple   lemon   2002-01-01
    3   4   4   4   Mango   Orange  2002-01-01

Step 2:

In this group, replace the values of columns ["d1","d2"] with index (not the DF index) of sorted mean values based on c.

For example in the above group mean(c, d1="Apple") = [9+0]/2 => 4.5 and mean(c, d1="Mango") = [9+4]/2 => 6.5 so the ascending sorted index is Apple:0 and Mango:1

so the value of column d1 will be replaced like the following:

            a   b   c   d1       d2       date
        0   2   7   9   0      Orange   2002-01-01
        1   6   0   9   1      lemon    2002-01-01
        2   8   0   0   0      lemon    2002-01-01
        3   4   4   4   1      Orange   2002-01-01

Apply this for the entire df. I have a brute force approach of iterating through the groups and each rows, any suggestions for a more pandas based solution will be helpful in improving the efficiency.

like image 873
data_person Avatar asked Jun 19 '20 22:06

data_person


2 Answers

Is this what you are looking for in column d1? You could apply some similar technique to d2 as well. Its not the most elegant solution though.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,10,size=(7,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Mango","Mango","Mango","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])

df['mean_value'] = df.groupby(['date', 'd1'])['c'].transform(lambda x: np.mean(x))
df['rank_value'] = (df.groupby(['date'])['mean_value'].rank(ascending=True, method='dense') - 1).astype(int)
df['d1'] = df['rank_value']
df.drop(labels=['rank_value', 'mean_value'], axis=1, inplace=True)

df

   a  b  c  d1      d2       date
0  3  1  4   1  Orange 2002-01-01
1  9  7  5   0   lemon 2002-01-01
2  9  9  5   1   lemon 2002-01-01
3  8  1  2   0  Orange 2002-01-01
4  8  0  1   0   lemon 2002-02-01
5  1  8  3   0  Orange 2002-02-01
6  8  0  4   1   lemon 2002-02-01
like image 97
eNc Avatar answered Oct 10 '22 10:10

eNc


You may use pivot_table and groupby.rank to create rank. After that use map to assign values back

df1 = df.pivot_table('c', ['date','d1']).groupby(level=0).rank(method='dense')-1
df['d1'] = df[['date','d1']].agg(tuple, axis=1).map(df1.c).astype('int')

Out[255]:
   a  b  c  d1      d2        date
0  2  7  9   0  Orange  2002-01-01
1  6  0  9   1   lemon  2002-01-01
2  8  0  0   0   lemon  2002-01-01
3  4  4  4   1  Orange  2002-01-01
4  5  0  8   0   lemon  2002-02-01
5  6  1  6   0  Orange  2002-02-01
6  7  2  7   0   lemon  2002-02-01

Note: group 2002-02-01 has the same mean value 7 for both Mango and Apple so the rank is all 0

like image 40
Andy L. Avatar answered Oct 10 '22 11:10

Andy L.