Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dismantle dataframe into new dataframes of subsets/groups resp. create new dataframes of data subsets/groups from other dataframe

I have a pandas dataframe that looks like the following and holds groups of data via a column id:

import numpy as np
import pandas as pd


df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
df['id'] = ['W', 'W', 'W', 'Z', 'Z', 'Y', 'Y', 'Y', 'Z', 'Z']

print(df)

          A         B         C         D id
0  0.347501 -1.152416  1.441144 -0.144545  w
1  0.775828 -1.176764  0.203049 -0.305332  w
2  1.036246 -0.467927  0.088138 -0.438207  w
3 -0.737092 -0.231706  0.268403  0.464026  x
4 -1.857346 -1.420284 -0.515517 -0.231774  x
5 -0.970731  0.217890  0.193814 -0.078838  y
6 -0.318314 -0.244348  0.162103  1.204386  y
7  0.340199  1.074977  1.201068 -0.431473  y
8  0.202050  0.790434  0.643458 -0.068620  z
9 -0.882865  0.687325 -0.008771 -0.066912  z

Now I want to create new dataframes (named df_w, df_x, df_y, df_z) which only hold their data from the original dataframe and are optimally combined within some iterable e.g. a list:

df_w

          A         B         C         D id
0  0.347501 -1.152416  1.441144 -0.144545  w
1  0.775828 -1.176764  0.203049 -0.305332  w
2  1.036246 -0.467927  0.088138 -0.438207  w

df_x

          A         B         C         D id
0 -0.737092 -0.231706  0.268403  0.464026  x
1 -1.857346 -1.420284 -0.515517 -0.231774  x

df_y

          A         B         C         D id
0 -0.970731  0.217890  0.193814 -0.078838  y
1 -0.318314 -0.244348  0.162103  1.204386  y
2  0.340199  1.074977  1.201068 -0.431473  y

df_z

          A         B         C         D id
0  0.202050  0.790434  0.643458 -0.068620  z
1 -0.882865  0.687325 -0.008771 -0.066912  z

Is there any smart (vectorized pandas) way to achieve this using groupby, apply and/or applymap and a function?

I was thinking about iterating over the dataframe but it doesn't seem to be very elegant..

Thanks in advance for any hints!

like image 574
Cord Kaldemeyer Avatar asked Jan 03 '23 15:01

Cord Kaldemeyer


2 Answers

we can create a dict of DFs:

In [166]: dfs = {k:v for k,v in df.groupby('id')}

In [168]: dfs.keys()
Out[168]: dict_keys(['W', 'Y', 'Z'])

In [169]: dfs['W']
Out[169]:
          A         B         C         D id
0 -0.373021 -0.555218  0.022980 -0.512323  W
1 -1.599466  0.637292  0.045059 -0.334030  W
2  0.100659  0.557068  0.142226 -0.186214  W

In [170]: dfs['Y']
Out[170]:
          A         B         C         D id
5  0.540107 -0.739077  0.992408  2.010203  Y
6 -0.201376 -0.913222 -0.173284  1.837442  Y
7 -1.367659  0.915360  0.072720 -0.886071  Y

In [171]: dfs['Z']
Out[171]:
          A         B         C         D id
3 -0.329087  0.842431  0.839319 -0.597823  Z
4 -0.594375 -0.950486  1.125584  0.116599  Z
8  0.366667 -0.978279 -1.449893  0.192451  Z
9 -0.007439 -0.084612  0.010192 -0.417602  Z

UPDATE: with reset index:

In [177]: {k:v.reset_index(drop=True) for k,v in df.groupby('id')}
Out[177]:
{'W':           A         B         C         D id
 0 -0.373021 -0.555218  0.022980 -0.512323  W
 1 -1.599466  0.637292  0.045059 -0.334030  W
 2  0.100659  0.557068  0.142226 -0.186214  W,
 'Y':           A         B         C         D id
 0  0.540107 -0.739077  0.992408  2.010203  Y
 1 -0.201376 -0.913222 -0.173284  1.837442  Y
 2 -1.367659  0.915360  0.072720 -0.886071  Y,
 'Z':           A         B         C         D id
 0 -0.329087  0.842431  0.839319 -0.597823  Z
 1 -0.594375 -0.950486  1.125584  0.116599  Z
 2  0.366667 -0.978279 -1.449893  0.192451  Z
 3 -0.007439 -0.084612  0.010192 -0.417602  Z}
like image 181
MaxU - stop WAR against UA Avatar answered Jan 31 '23 14:01

MaxU - stop WAR against UA


I think the best is create dict by convert groupby object to tuples and then to dict:

#for index starts from 0
df.index = df.groupby('id').cumcount()

dfs = dict(tuple(df.groupby('id')))
print (dfs)
{'W':           A         B         C         D id
0  1.331587  0.715279 -1.545400 -0.008384  W
1  0.621336 -0.720086  0.265512  0.108549  W
2  0.004291 -0.174600  0.433026  1.203037  W, 'Y': A   B         C         D id
0 -1.977728 -1.743372  0.266070  2.384967  Y
1  1.123691  1.672622  0.099149  1.397996  Y
2 -0.271248  0.613204 -0.267317 -0.549309  Y, 'Z': A   B         C         D id
0 -0.965066  1.028274  0.228630  0.445138  Z
1 -1.136602  0.135137  1.484537 -1.079805  Z
2  0.132708 -0.476142  1.308473  0.195013  Z
3  0.400210 -0.337632  1.256472 -0.731970  Z}

print (dfs['Y'])
          A         B         C         D id
0 -1.977728 -1.743372  0.266070  2.384967  Y
1  1.123691  1.672622  0.099149  1.397996  Y
2 -0.271248  0.613204 -0.267317 -0.549309  Y

For interesting is possible use custom DataFrame names by globals, but better is dict:

for i, df in df.groupby('id'):
     globals()['df_' + i] = df.reset_index(drop=True)

print (df_Y)
          A         B         C         D id
0 -1.977728 -1.743372  0.266070  2.384967  Y
1  1.123691  1.672622  0.099149  1.397996  Y
2 -0.271248  0.613204 -0.267317 -0.549309  Y
like image 24
jezrael Avatar answered Jan 31 '23 14:01

jezrael