From a Pandas newbie: I have data that looks essentially like this -
data1=pd.DataFrame({'Dir':['E','E','W','W','E','W','W','E'], 'Bool':['Y','N','Y','N','Y','N','Y','N'], 'Data':[4,5,6,7,8,9,10,11]}, index=pd.DatetimeIndex(['12/30/2000','12/30/2000','12/30/2000','1/2/2001','1/3/2001','1/3/2001','12/30/2000','12/30/2000']))
data1
Out[1]:
Bool Data Dir
2000-12-30 Y 4 E
2000-12-30 N 5 E
2000-12-30 Y 6 W
2001-01-02 N 7 W
2001-01-03 Y 8 E
2001-01-03 N 9 W
2000-12-30 Y 10 W
2000-12-30 N 11 E
And I want to group it by multiple levels, then do a cumsum():
E.g., like running_sum=data1.groupby(['Bool','Dir']).cumsum()
<-(Doesn't work)
with output that would look something like:
Bool Dir Date running_sum
N E 2000-12-30 16
W 2001-01-02 7
2001-01-03 16
Y E 2000-12-30 4
2001-01-03 12
W 2000-12-30 16
My "like" code is clearly not even close. I have made a number of attempts and learned many new things about how not to do this.
Thanks for any help you can give.
groupby() can accept several different arguments: A column or list of columns. A dict or pandas Series. A NumPy array or pandas Index , or an array-like iterable of these.
Pandas DataFrame cumsum() Method The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.
groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
Try this:
data2 = data1.reset_index()
data3 = data2.set_index(["Bool", "Dir", "index"]) # index is the new column created by reset_index
running_sum = data3.groupby(level=[0,1,2]).sum().groupby(level=[0,1]).cumsum()
The reason you cannot simply use cumsum
on data3
has to do with how your data is structured. Grouping by Bool
and Dir
and applying an aggregation function (sum
, mean
, etc) would produce a DataFrame of a smaller size than you started with, as whatever function you used would aggregate values based on your group keys. However cumsum
is not an aggreagation function. It wil return a DataFrame that is the same size as the one it's called with. So unless your input DataFrame is in a format where the output can be the same size after calling cumsum
, it will throw an error. That's why I called sum
first, which returns a DataFrame in the correct input format.
Sorry if I haven't explained this well enough. Maybe someone else could help me out?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With