I admit that I am not a Python guru, but still I find dealing with Pandas DataFrameGroupBy
and SeriesGroupBy
objects exceptionally counter-intuitive. ( I have an R background.)
I have the dataframe below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id' : range(1,9),
'code' : ['one', 'one', 'two', 'three',
'two', 'three', 'one', 'two'],
'colour': ['black', 'white','white','white',
'black', 'black', 'white', 'white'],
'irrelevant1': ['foo', 'foo', 'foo','bar','bar',
'foo','bar','bar'],
'irrelevant2': ['foo', 'foo', 'foo','bar','bar',
'foo','bar','bar'],
'irrelevant3': ['foo', 'foo', 'foo','bar','bar',
'foo','bar','bar'],
'amount' : np.random.randn(8)}, columns= ['id','code','colour', 'irrelevant1', 'irrelevant2', 'irrelevant3', 'amount'])
I want to be able to get the id
's grouped by code
and colour
. The code below does the grouping but keeps all columns.
gb = df.groupby(['code','colour'])
gb.head(5)
id code colour irrelevant1 irrelevant2 irrelevant3 amount
code colour
one black 0 1 one black foo foo foo -0.644170
white 1 2 one white foo foo foo 0.912372
6 7 one white bar bar bar 0.530575
three black 5 6 three black foo foo foo -0.123806
white 3 4 three white bar bar bar -0.387080
two black 4 5 two black bar bar bar -0.578107
white 2 3 two white foo foo foo 0.768637
7 8 two white bar bar bar -0.282577
Questions:
1) In gb
, how do I only store the id
column (and not even any index) and get rid of the rest?
2) Once I have the desired DataFrameGroupBy
gb
, how do I access the id
s of cases where {code = one and colour=white} ? I tried gb.get_group('one','white')
and gb.get_group(['one','white'])
but they do not work.
3) How do I access entries where {colour=white}, i.e. lacking the code
index ?
4) Finally, the manual is not very helpful, do you know of any sources where there are examples of how to create and access these grouped objects?
For your problem, you don't even need to perform a groupby
(but you should read more about it in the prose docs.
A better solution would be a MultiIndex
:
In [36]: df = df.set_index(['code', 'colour']).sort_index()
In [37]: df
Out[37]:
id irrelevant1 irrelevant2 irrelevant3 amount
code colour
one black 1 foo foo foo 0.103045
white 2 foo foo foo 0.751824
white 7 bar bar bar -1.275114
three black 6 foo foo foo 0.311305
white 4 bar bar bar -0.416722
two black 5 bar bar bar 1.534859
white 3 foo foo foo -1.068399
white 8 bar bar bar -0.243893
[8 rows x 5 columns]
That takes care of 1.
2: Use the familiar slicing syntax:
In [38]: df.loc['one', 'white']
Out[38]:
id irrelevant1 irrelevant2 irrelevant3 amount
code colour
one white 2 foo foo foo 0.751824
white 7 bar bar bar -1.275114
[2 rows x 5 columns]
3: This is a cross-section, use .xs
:
In [39]: df.xs('white', level='colour')
Out[39]:
id irrelevant1 irrelevant2 irrelevant3 amount
code
one 2 foo foo foo 0.751824
one 7 bar bar bar -1.275114
three 4 bar bar bar -0.416722
two 3 foo foo foo -1.068399
two 8 bar bar bar -0.243893
[5 rows x 5 columns]
4: Examples are all over the place. Check the pandas / groupby tag here, this section of the docs is being worked on right now, the prose docs linked above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With