I have a Dataframe which has three columns: nums
with some values to work with, b
which is always either 1
or 0
and the result
column which is currently zero everywhere except in the first row (because we must have an initial value to work with).
The dataframe looks like this:
nums b result
0 20.0 1 20.0
1 22.0 0 0
2 30.0 1 0
3 29.1 1 0
4 20.0 0 0
...
I'd like to go over each row in the dataframe starting with the second row, do some calculation and store the result in the result
column. Since I'm working with large files, I need a way to make this operation fast so that's why I want something like apply
.
The calculation I want to do is to take the value in nums
and in result
from the previous row, and if in the current row the b
col is 0
then I want (for example) to add the num
and the result
from that previous row. If b
in that row is 1
I'd like to substract them for example.
I tried using apply
but I couldn't access the previous row and sadly it seems that if I do manage to access the previous row, the dataframe won't update the result column until the end.
I also tried using a loop like so, but it's too slow for the large filews I'm working with:
for i in range(1, len(df.index)):
row = df.index[i]
new_row = df.index[i - 1] # get index of previous row for "nums" and "result"
df.loc[row, 'result'] = some_calc_func(prev_result=df.loc[new_row, 'result'], prev_num=df.loc[new_row, 'nums'], \
current_b=df.loc[row, 'b'])
some_calc_func
looks like this (just a general example):
def some_calc_func(prev_result, prev_num, current_b):
if current_b == 1:
return prev_result * prev_num / 2
else:
return prev_num + 17
Please answer with respect to some_calc_func
If you want to keep the function some_calc_func
and not use another library, you should not try to access each element at each iteration, you can use zip
on the columns nums and b with a shift between both as you try to access nums from the previous row and keep in memory the prev_res at each iteration. Also, append
to a list instead of the dataframe, and after the loop assign the list to the column.
prev_res = df.loc[0, 'result'] #get first result
l_res = [prev_res] #initialize the list of results
# loop with zip to get both values at same time,
# use loc to start b at second row but not num
for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):
# use your function to calculate the new prev_res
prev_res = some_calc_func (prev_res, prev_num, curren_b)
# add to the list of results
l_res.append(prev_res)
# assign to the column
df['result'] = l_res
print (df) #same result than with your method
nums b result
0 20.0 1 20.0
1 22.0 0 37.0
2 30.0 1 407.0
3 29.1 1 6105.0
4 20.0 0 46.1
Now with a dataframe df of 5000 rows, I got:
%%timeit
prev_res = df.loc[0, 'result']
l_res = [prev_res]
for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):
prev_res = some_calc_func (prev_res, prev_num, curren_b)
l_res.append(prev_res)
df['result'] = l_res
# 4.42 ms ± 695 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
and with your original solution, it was ~750x slower
%%timeit
for i in range(1, len(df.index)):
row = df.index[i]
new_row = df.index[i - 1] # get index of previous row for "nums" and "result"
df.loc[row, 'result'] = some_calc_func(prev_result=df.loc[new_row, 'result'], prev_num=df.loc[new_row, 'nums'], \
current_b=df.loc[row, 'b'])
#3.25 s ± 392 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
EDIT with another library called numba
, if the function some_calc_func
can be easily used with Numba decorator.
from numba import jit
# decorate your function
@jit
def some_calc_func(prev_result, prev_num, current_b):
if current_b == 1:
return prev_result * prev_num / 2
else:
return prev_num + 17
# create a function to do your job
# numba likes numpy arrays
@jit
def with_numba(prev_res, arr_nums, arr_b):
# array for results and initialize
arr_res = np.zeros_like(arr_nums)
arr_res[0] = prev_res
# loop on the length of arr_b
for i in range(len(arr_b)):
#do the calculation and set the value in result array
prev_res = some_calc_func (prev_res, arr_nums[i], arr_b[i])
arr_res[i+1] = prev_res
return arr_res
Finally, call it like
df['result'] = with_numba(df.loc[0, 'result'],
df['nums'].to_numpy(),
df.loc[1:, 'b'].to_numpy())
And with a timeit, I get another ~9x faster than my method with zip, and the speed up could increase with the size
%timeit df['result'] = with_numba(df.loc[0, 'result'],
df['nums'].to_numpy(),
df.loc[1:, 'b'].to_numpy())
# 526 µs ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Note using Numba might be problematic depending on your actual some_calc_func
IIUC:
>>> df['result'] = (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums
).fillna(df.result).cumsum()
>>> df
nums b result
0 20.0 1 20.0
1 22.0 0 42.0
2 30.0 1 12.0
3 29.1 1 -17.1
4 20.0 0 2.9
Explanation:
# replace 0 with 1 and 1 with -1 in column `b` for rows where result==0
>>> df[df.result.eq(0)].b.replace({0: 1, 1: -1})
1 1
2 -1
3 -1
4 1
Name: b, dtype: int64
# multiply with nums
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums)
0 NaN
1 22.0
2 -30.0
3 -29.1
4 20.0
dtype: float64
# fill the 'NaN' with the corresponding value from df.result (which is 20 here)
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result)
0 20.0
1 22.0
2 -30.0
3 -29.1
4 20.0
dtype: float64
# take the cumulative sum (cumsum)
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result).cumsum()
0 20.0
1 42.0
2 12.0
3 -17.1
4 2.9
dtype: float64
According to your requirement in comments, I can not think of a way without loops:
c1, c2 = 2, 1
l = [df.loc[0, 'result']] # store the first result in a list
# then loop over the series (df.b * df.nums)
for i, val in (df.b * df.nums).iteritems():
if i: # except for 0th index
if val == 0: # (df.b * df.nums) == 0 if df.b == 0
l.append(l[-1]) # append the last result
else: # otherwise apply the rule
t = l[-1] *c2 + val * c1
l.append(t)
>>> l
[20.0, 20.0, 80.0, 138.2, 138.2]
>>> df['result'] = l
nums b result
0 20.0 1 20.0
1 22.0 0 20.0
2 30.0 1 80.0 # [ 20 * 1 + 30 * 2]
3 29.1 1 138.2 # [ 80 * 1 + 29.1 * 2]
4 20.0 0 138.2
Seems fast enough, did not test for large sample.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With