Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split the index into separate columns in pandas

Tags:

python

pandas

I have a large dataframe from which I get the data I need with groupby. I need to get several separate columns from the index of new dataframe.

Part of the original dataframe looks like this:

        code         place     vl   year    week
0   111.0002.0056   region1     1   2017    29
1   112.6500.2285   region2     1   2017    31
2   112.5600.6325   region2     1   2017    30
3   112.5600.6325   region2     1   2017    30
4   112.5600.8159   region2     1   2017    30
5   111.0002.0056   region2     1   2017    29
6   111.0002.0056   region2     1   2017    30
7   111.0002.0056   region2     1   2017    28
8   112.5600.8159   region3     1   2017    31
9   112.5600.8159   region3     1   2017    28
10  111.0002.0114   region3     1   2017    31
....

After applying groupby, it looks like this (code: df_test1 = df_test.groupby(['code' , 'year', 'week', 'place'])['vl'].sum().unstack(fill_value=0)):

                        place  region1  region2  region3  region4  index1
code            year    week                    
111.0002.0006   2017     29       0        3        0        0     (111.0002.0006, 2017, 29)
                         30       0        7        0        0     (111.0002.0006, 2017, 30)
111.0002.0018   2017     29       0        0        0        0     (111.0002.0018, 2017, 29)
111.0002.0029   2017     30       0        0        0        0     (111.0002.0029, 2017, 30)
111.0002.0055   2017     28       0        33       0        8     (111.0002.0055, 2017, 28)
                         29       1        155      2        41    (111.0002.0055, 2017, 29)
                         30       0        142      1        39    (111.0002.0055, 2017, 30)
                         31       0        31       0        13    (111.0002.0055, 2017, 31)
111.0002.0056   2017     28       9        36       0        4     (111.0002.0056, 2017, 28)
                         29       20       75       2        37    (111.0002.0056, 2017, 29)
                         30       17       81       2        33    (111.0002.0056, 2017, 30)
....

I save the index in a separate column index1 (code: df_test1['index1'] = df_test1.index) I need to get out of the column index1 three separate columns code, year and week.

The result should look like this:

region1 region2 region3 region4       code     year  week                   
   0       3       0       0    111.0002.0006  2017   29
   0       7       0       0    111.0002.0006  2017   30
   0       0       0       0    111.0002.0018  2017   29
   0       0       0       0    111.0002.0029  2017   30
   0       33      0       8    111.0002.0055  2017   28
   1       155     2       41   111.0002.0055  2017   29
   0       142     1       39   111.0002.0055  2017   30
   0       31      0       13   111.0002.0055  2017   31
....

I would be grateful for any advice!

like image 641
yanadm Avatar asked Aug 29 '17 12:08

yanadm


3 Answers

You add reset_index instead df_test1['index1'] = df_test1.index and for clean df add rename_axis - it remove column name place:

df_test1 = df_test.groupby(['code' , 'year', 'week',  'place'])['vl'].sum() \
                  .unstack(fill_value=0) \
                  .reset_index() \
                  .rename_axis(None, axis=1)
print (df_test1)

            code  year  week  region1  region2  region3
0  111.0002.0056  2017    28        0        1        0
1  111.0002.0056  2017    29        1        1        0
2  111.0002.0056  2017    30        0        1        0
3  111.0002.0114  2017    31        0        0        1
4  112.5600.6325  2017    30        0        2        0
5  112.5600.8159  2017    28        0        0        1
6  112.5600.8159  2017    30        0        1        0
7  112.5600.8159  2017    31        0        0        1
8  112.6500.2285  2017    31        0        1        0

Last if necessary change ordering of columns:

#all cols are columns in df_test1
cols = ['code' , 'year', 'week']
df_test1 = df_test1[[x for x in df_test1.columns if x not in cols] + cols]
print (df_test1)
   region1  region2  region3           code  year  week
0        0        1        0  111.0002.0056  2017    28
1        1        1        0  111.0002.0056  2017    29
2        0        1        0  111.0002.0056  2017    30
3        0        0        1  111.0002.0114  2017    31
4        0        2        0  112.5600.6325  2017    30
5        0        0        1  112.5600.8159  2017    28
6        0        1        0  112.5600.8159  2017    30
7        0        0        1  112.5600.8159  2017    31
8        0        1        0  112.6500.2285  2017    31
like image 53
jezrael Avatar answered Sep 20 '22 00:09

jezrael


Or you can try this pd.crosstab

df=df.set_index(['code', 'year', 'week','vl'])
df=pd.crosstab(df.index,df.place).reset_index()
df[['code', 'year', 'week','vl']]=df['row_0'].apply(pd.Series).drop('row_0',axis=1)

Out[32]: 
place  region1  region2  region3           code  year  week  vl
0            0        1        0  111.0002.0056  2017    28   1
1            1        1        0  111.0002.0056  2017    29   1
2            0        1        0  111.0002.0056  2017    30   1
3            0        0        1  111.0002.0114  2017    31   1
4            0        2        0  112.5600.6325  2017    30   1
5            0        0        1  112.5600.8159  2017    28   1
6            0        1        0  112.5600.8159  2017    30   1
7            0        0        1  112.5600.8159  2017    31   1
8            0        1        0  112.6500.2285  2017    31   1
like image 26
BENY Avatar answered Sep 18 '22 00:09

BENY


You can skip creating index1 entirely and use the get_level_values(<index>) method of your df_test1.index. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.get_level_values.html#pandas.MultiIndex.get_level_values The calls should look something like

df_test1['code'] = df_test1.index.get_level_values(0)
df_test1['year'] = df_test1.index.get_level_values(1)
df_test1['week'] = df_test1.index.get_level_values(2)

This should work no matter how you generated the MultiIndex - whether by groupby(), pivot_table(), or otherwise.

like image 41
Sarah Messer Avatar answered Sep 18 '22 00:09

Sarah Messer