Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to groupby and map by two columns pandas dataframe

i have a problem on python working with a pandas dataframe i'm trying to make a machine learning model predictin the surface . I have the surface column in the train dataframe and i don't have it in the test dataframe . So , i would to create some features based on the surface in the train like .

train['error_cat1'] = abs(train.groupby(train['cat1'])['surface'].transform('mean')  - train.surface.mean())

here i have set the values of grouby by "cat" feature with the mean of suface . Cool

now i must add it to the test too . So , will use this method to map the values from the train for each groupby to the test row .

mp = {k: g['error_cat1'].tolist()[0] for k,g in train.groupby('cat1')}
test['error_cat1'] = test['cat1'].map(mp)

So , far there is no problem . Now , i would use two columns in groupby .

train['error_cat1_cat2'] = abs(train.groupby(train[['cat1','cat2']])['surface'].transform('mean')  - train.surface.mean())

but i don't know how to map it for test dataframe . Please can you help me handling this problem or give me some other methods so i can do it .


for example my train is

| Cat1 | Cat2 | surface |
| 1    | 3    | 10    |
| 2    | 2    | 12    |
| 3    | 1    | 12    |
| 1    | 3    | 5     |
| 2    | 2    | 10    |
| 3    | 2    | 13    |

my test is

| Cat1 | Cat2 |
| 1    | 2    |
| 2    | 1    |
| 3    | 1    |
| 1    | 3    |
| 2    | 3    |
| 3    | 1    |

Now i would do a groupby mean surface on the cat1 and cat2 for example the mean surface on (cat1,cat2)=(1,3) is (10+5)/2 = 7.5

Now , i must go to the test and map this value on the (cat1,cat2)=(1,3) rows .

i hope that you have got me .

like image 412
John Karimov Avatar asked Oct 18 '22 01:10

John Karimov

1 Answers

You can use

  • groupby().means() to calculate means
  • reset_index() to convert indexes Cat1, Cat2 into columns again
  • merge(how='left', ) to join two dataframes like tables in database (LEFT JOIN in SQL).


headers = ['Cat1', 'Cat2', 'surface']

train_data = [
    [1, 3, 10],
    [2, 2, 12],
    [3, 1, 12],
    [1, 3, 5],
    [2, 2, 10],
    [3, 2, 13],

test_data = [
    [1, 2],
    [2, 1],
    [3, 1],
    [1, 3],
    [2, 3],
    [3, 1],
import pandas as pd

train = pd.DataFrame(train_data, columns=headers)
test = pd.DataFrame(test_data, columns=headers[:-1])

print('--- train ---')

print('--- test ---')

print('--- means ---')
means = train.groupby(['Cat1', 'Cat2']).mean()

print('--- means (dataframe) ---')
means = means.reset_index(level=['Cat1', 'Cat2'])

print('--- result ----')
result = pd.merge(df2, means, on=['Cat1', 'Cat2'], how='left')

print('--- result (fillna)---')
result = result.fillna(0)


--- train ---
   Cat1  Cat2  surface
0     1     3       10
1     2     2       12
2     3     1       12
3     1     3        5
4     2     2       10
5     3     2       13
--- test ---
   Cat1  Cat2
0     1     2
1     2     1
2     3     1
3     1     3
4     2     3
5     3     1
--- means ---
Cat1 Cat2         
1    3         7.5
2    2        11.0
3    1        12.0
     2        13.0
--- means (dataframe) ---
   Cat1  Cat2  surface
0     1     3      7.5
1     2     2     11.0
2     3     1     12.0
3     3     2     13.0
--- result ----
   Cat1  Cat2  surface
0     1     2      NaN
1     2     1      NaN
2     3     1     12.0
3     1     3      7.5
4     2     3      NaN
5     3     1     12.0
--- result (fillna)---
   Cat1  Cat2  surface
0     1     2      0.0
1     2     1      0.0
2     3     1     12.0
3     1     3      7.5
4     2     3      0.0
5     3     1     12.0
like image 61
furas Avatar answered Oct 21 '22 05:10
