I use pandas for grouping a dataset. When I aggregate different columns with different functions I'm getting a hierarchical column-structure.
G1 = df.groupby('date').agg({'col1': [sum, np.mean], 'col2': 'sum', 'col3': np.mean})
results in:
col1 col2 col3
sum mean sum mean
date
2000-11-01 1701 1.384052 82336 54.222945
2000-11-02 11101 1.447894 761963 70.027260
2000-11-03 11285 1.479418 823355 77.984268
I couldn't find too much about this resulting structure in the docs unfortunately. The only thing I found in pandas docs was the hierarchical multi-index.
How can I access the values?
Currently I do: X['col1']['mean']
to access the whole Series
2000-11-01 1.384052
2000-11-02 1.447894
2000-11-03 1.479418
and thus X['col1']['mean'][1]
to get the value 1.447894
, but I wonder about the performance, because this code first slices col1
(X['col1']) which results in a view/copy (dunno which one in this case) containing actually 2 columns, and then there is yet another slice of the mean
-column.
Any tips? And where can I find more about the creation of the hierarchical columns in the docs?
Sort Values in Descending Order with Groupby You can sort values in descending order by using ascending=False param to sort_values() method. The head() function is used to get the first n rows. It is useful for quickly testing if your object has the right type of data in it.
Using reset_index() function Pandas provide a function called reset_index() to flatten the hierarchical index created due to the groupby aggregation function in Python.
How to perform groupby index in pandas? Pass index name of the DataFrame as a parameter to groupby() function to group rows on an index. DataFrame. groupby() function takes string or list as a param to specify the group columns or index.
Groupby preserves the order of rows within each group.
The advice is to do these in one pass (without chaining), this especially allows you to do assignment (rather than assigning to a view and the modification being garbage collected).
Access a MultiIndex* column as a tuple:
In [11]: df[('col1', 'mean')]
Out[11]:
date
2000-11-01 1.384052
2000-11-02 1.447894
2000-11-03 1.479418
Name: (col1, mean), dtype: float64
and a specific value using loc:
In [12]: df.loc['2000-11-01', ('col1', 'mean')]
Out[12]: 1.3840520000000001
(To mix labels, loc, and position, iloc, you have to use ix)
In [13]: df.ix[0, ('col1', 'mean')]
Out[13]: 1.3840520000000001
*This is a MultiIndex.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With