I have a dataset that contains the feeding data of 3 animals, consisting of the animals' tag ids (1,2,3), the type (A,B) and amount (kg) of feed given at each 'meal':
Animal FeedType Amount(kg)
Animal1 A 10
Animal2 B 7
Animal3 A 4
Animal2 A 2
Animal1 B 5
Animal2 B 6
Animal3 A 2
In base R, I can easily output the matrix below which has unique('Animal')
as its rows, unique('FeedType')
as its columns and the cumulative Amount (kg)
in the corresponding cells of the matrix by using tapply()
as below
out <- with(mydf, tapply(Amount, list(Animal, FeedType), sum))
A B
Animal1 10 5
Animal2 2 13
Animal3 6 NA
Is there an equivalent functionality for a Python Pandas dataframe? What is the most elegant and fastest way to achieve this in Pandas?
P.S. I want to be able to specify on what column, in this case Amount
, to perform the aggregation.
Thanks in advance.
EDIT:
I tried both approaches in the two answers. Performance results with my actual Pandas data-frame of 216,347 rows and 15 columns:
start_time1 = timeit.default_timer()
mydf.groupby(['Animal','FeedType'])['Amount'].sum()
elapsed_groupby = timeit.default_timer() - start_time1
start_time2 = timeit.default_timer()
mydf.pivot_table(rows='Animal', cols='FeedType',values='Amount',aggfunc='sum')
elapsed_pivot = timeit.default_timer() - start_time2
print ('elapsed_groupby: ' + str(elapsed_groupby))
print ('elapsed_pivot: ' + str(elapsed_pivot))
gives:
elapsed_groupby: 10.172213
elapsed_pivot: 8.465783
So in my case, pivot_table() works faster.
Pandas for Python and Dplyr for R are the two most popular libraries for working with tabular/structured data for many Data Scientists.
Dplython. Package dplython is dplyr for Python users. It provide infinite functionality for data preprocessing.
Both Pandas and dplyr can connect to virtually any data source, and read from any file format. That's why we won't spend any time exploring connection options but will use a build-in dataset instead. There's no winner in this Pandas vs. dplyr comparison, as both libraries are near identical with the syntax.
First I read in your data:
In [7]: df = pd.read_clipboard(sep="\s+", index_col=False)
In [8]: df
Out[8]:
Animal FeedType Amount(kg)
0 Animal1 A 10
1 Animal2 B 7
2 Animal3 A 4
3 Animal2 A 2
4 Animal1 B 5
5 Animal2 B 6
6 Animal3 A 2
Then I can groupby the two columns to aggregate:
In [9]: df.groupby(['Animal','FeedType']).sum()
Out[9]:
Amount(kg)
Animal FeedType
Animal1 A 10
B 5
Animal2 A 2
B 13
Animal3 A 6
To get it in the same format, I can unstack
the dataframe
:
In [10]: df.groupby(['Animal','FeedType']).sum().unstack()
Out[10]:
Amount(kg)
FeedType A B
Animal
Animal1 10 5
Animal2 2 13
Animal3 6 NaN
The approach of @Zelazny7 with groupby
and unstack
is certainly fine, but for completeness, you can also do this directly with pivot_table
(see doc) [version 0.13 and below]:
In [13]: df.pivot_table(rows='Animal', cols='FeedType', values='Amount(kg)', aggfunc='sum')
Out[13]:
FeedType A B
Animal
Animal1 10 5
Animal2 2 13
Animal3 6 NaN
In newer versions of Pandas (version 0.14 and latter), arguments of pivot_table have been changed:
In [13]: df.pivot_table(index='Animal', columns='FeedType', values='Amount(kg)', aggfunc='sum')
Out[13]:
FeedType A B
Animal
Animal1 10 5
Animal2 2 13
Animal3 6 NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With