Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas `DataFrameGroupBy` and `SeriesGroupBy`

I admit that I am not a Python guru, but still I find dealing with Pandas DataFrameGroupBy and SeriesGroupBy objects exceptionally counter-intuitive. ( I have an R background.)

I have the dataframe below:

import pandas as pd
import numpy as np
df = pd.DataFrame({'id' : range(1,9),
                   'code' : ['one', 'one', 'two', 'three',
                             'two', 'three', 'one', 'two'],
                   'colour': ['black', 'white','white','white',
                           'black', 'black', 'white', 'white'],
                   'irrelevant1': ['foo', 'foo', 'foo','bar','bar',
                                     'foo','bar','bar'],
                   'irrelevant2': ['foo', 'foo', 'foo','bar','bar',
                                     'foo','bar','bar'],
                   'irrelevant3': ['foo', 'foo', 'foo','bar','bar',
                                     'foo','bar','bar'],
                   'amount' : np.random.randn(8)},  columns= ['id','code','colour', 'irrelevant1', 'irrelevant2', 'irrelevant3', 'amount'])

I want to be able to get the id's grouped by code and colour. The code below does the grouping but keeps all columns.

gb = df.groupby(['code','colour'])
gb.head(5)
                id   code colour irrelevant1 irrelevant2 irrelevant3    amount
code  colour                                                                  
one   black  0   1    one  black         foo         foo         foo -0.644170
      white  1   2    one  white         foo         foo         foo  0.912372
             6   7    one  white         bar         bar         bar  0.530575
three black  5   6  three  black         foo         foo         foo -0.123806
      white  3   4  three  white         bar         bar         bar -0.387080
two   black  4   5    two  black         bar         bar         bar -0.578107
      white  2   3    two  white         foo         foo         foo  0.768637
             7   8    two  white         bar         bar         bar -0.282577

Questions:

1) In gb, how do I only store the id column (and not even any index) and get rid of the rest?

2) Once I have the desired DataFrameGroupBy gb, how do I access the ids of cases where {code = one and colour=white} ? I tried gb.get_group('one','white') and gb.get_group(['one','white']) but they do not work.

3) How do I access entries where {colour=white}, i.e. lacking the code index ?

4) Finally, the manual is not very helpful, do you know of any sources where there are examples of how to create and access these grouped objects?

like image 262
Zhubarb Avatar asked Mar 20 '23 23:03

Zhubarb


1 Answers

For your problem, you don't even need to perform a groupby (but you should read more about it in the prose docs.

A better solution would be a MultiIndex:

In [36]: df = df.set_index(['code', 'colour']).sort_index()

In [37]: df
Out[37]: 
              id irrelevant1 irrelevant2 irrelevant3    amount
code  colour                                                  
one   black    1         foo         foo         foo  0.103045
      white    2         foo         foo         foo  0.751824
      white    7         bar         bar         bar -1.275114
three black    6         foo         foo         foo  0.311305
      white    4         bar         bar         bar -0.416722
two   black    5         bar         bar         bar  1.534859
      white    3         foo         foo         foo -1.068399
      white    8         bar         bar         bar -0.243893

[8 rows x 5 columns]

That takes care of 1.

2: Use the familiar slicing syntax:

In [38]: df.loc['one', 'white']
Out[38]: 
             id irrelevant1 irrelevant2 irrelevant3    amount
code colour                                                  
one  white    2         foo         foo         foo  0.751824
     white    7         bar         bar         bar -1.275114

[2 rows x 5 columns]

3: This is a cross-section, use .xs:

In [39]: df.xs('white', level='colour')
Out[39]: 
       id irrelevant1 irrelevant2 irrelevant3    amount
code                                                   
one     2         foo         foo         foo  0.751824
one     7         bar         bar         bar -1.275114
three   4         bar         bar         bar -0.416722
two     3         foo         foo         foo -1.068399
two     8         bar         bar         bar -0.243893

[5 rows x 5 columns]

4: Examples are all over the place. Check the pandas / groupby tag here, this section of the docs is being worked on right now, the prose docs linked above.

like image 176
TomAugspurger Avatar answered Apr 02 '23 03:04

TomAugspurger