I recently learned about pandas
and was happy to see its analytics functionality. I am trying to convert Excel array functions into the Pandas equivalent to automate spreadsheets that I have created for the creation of performance attribution reports. In this example, I created a new column in Excel based on conditions within other columns:
={SUMIFS($F$10:$F$4518,$A$10:$A$4518,$C$4,$B$10:$B$4518,0,$C$10:$C$4518," ",$D$10:$D$4518,$D10,$E$10:$E$4518,$E10)}
The formula is summing up the values in the "F" array (security weights) based on certain conditions. "A" array (portfolio ID) is a certain number, "B" array (security id) is zero, "C" array (group description) is " ", "D" array (start date) is the date of the row that I am on, and "E" array (end date) is the date of the row that I am on.
In Pandas, I am using the DataFrame. Creating a new column on a dataframe with the first three conditions is straight forward, but I am having difficult with the last two conditions.
reportAggregateDF['PORT_WEIGHT'] = reportAggregateDF['SEC_WEIGHT_RATE']
[(reportAggregateDF['PORT_ID'] == portID) &
(reportAggregateDF['SEC_ID'] == 0) &
(reportAggregateDF['GROUP_LIST'] == " ") &
(reportAggregateDF['START_DATE'] == reportAggregateDF['START_DATE'].ix[:]) &
(reportAggregateDF['END_DATE'] == reportAggregateDF['END_DATE'].ix[:])].sum()
Obviously the .ix[:] in the last two conditions is not doing anything for me, but is there a way to make the sum conditional on the row that I am on without looping? My goal is to not do any loops, but instead use purely vector operations.
With SUMPRODUCT In spirit, the SUMPRODUCT option is closest to the SUMIFS formula since we are summing values based on multiple criteria. As before, it works fine as long as there is only one matching result.
Difference Between Excel SUMIF and Excel SUMIFS Functions Both are logical SUM functions that find the SUM of a given range based on specified conditions. The only difference between Excel SUMIFS & SUMIF functions is that SUMIFs can check for multiple criteria at once, while SUMIF can check for one criterion at a time.
SUMIFS is a challenger quickly gaining popularity with Excel users. SUMIFS can do much of what VLOOKUP can do … but better :-) That is, SUMIFS makes a great alternative to VLOOKUP, and here's why: 1. SUMIFS matches equivalent values when stored as different data types.
You want to use the apply function and a lambda:
>> df
A B C D E
0 mitfx 0 200 300 0.25
1 gs 1 150 320 0.35
2 duk 1 5 2 0.45
3 bmo 1 145 65 0.65
Let's say I want to sum column C times E but only if column B == 1 and D is greater than 5:
df['matches'] = df.apply(lambda x: x['C'] * x['E'] if x['B'] == 1 and x['D'] > 5 else 0, axis=1)
df.matches.sum()
It might be cleaner to split this into two steps:
df_subset = df[(df.B == 1) & (df.D > 5)]
df_subset.apply(lambda x: x.C * x.E, axis=1).sum()
or to use simply multiplication for speed:
df_subset = df[(df.B == 1) & (df.D > 5)]
print sum(df_subset.C * df_subset.E)
You are absolutely right to want to do this problem without loops.
I'm sure there is a better way, but this did it in a loop:
for idx, eachRecord in reportAggregateDF.T.iteritems():
reportAggregateDF['PORT_WEIGHT'].ix[idx] = reportAggregateDF['SEC_WEIGHT_RATE'][(reportAggregateDF['PORT_ID'] == portID) &
(reportAggregateDF['SEC_ID'] == 0) &
(reportAggregateDF['GROUP_LIST'] == " ") &
(reportAggregateDF['START_DATE'] == reportAggregateDF['START_DATE'].ix[idx]) &
(reportAggregateDF['END_DATE'] == reportAggregateDF['END_DATE'].ix[idx])].sum()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With