Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you apply a function on a dataframe column using data from previous rows?

I have a Dataframe which has three columns: nums with some values to work with, b which is always either 1 or 0 and the result column which is currently zero everywhere except in the first row (because we must have an initial value to work with). The dataframe looks like this:

   nums   b    result
0  20.0  1    20.0
1  22.0  0    0
2  30.0  1    0
3  29.1  1    0
4  20.0  0    0
...

The Problem

I'd like to go over each row in the dataframe starting with the second row, do some calculation and store the result in the result column. Since I'm working with large files, I need a way to make this operation fast so that's why I want something like apply.

The calculation I want to do is to take the value in nums and in result from the previous row, and if in the current row the b col is 0 then I want (for example) to add the num and the result from that previous row. If b in that row is 1 I'd like to substract them for example.

What have I tried?

I tried using apply but I couldn't access the previous row and sadly it seems that if I do manage to access the previous row, the dataframe won't update the result column until the end.

I also tried using a loop like so, but it's too slow for the large filews I'm working with:

       for i in range(1, len(df.index)):
            row = df.index[i]
            new_row = df.index[i - 1]  # get index of previous row for "nums" and "result"
            df.loc[row, 'result'] = some_calc_func(prev_result=df.loc[new_row, 'result'], prev_num=df.loc[new_row, 'nums'], \
                                     current_b=df.loc[row, 'b'])

some_calc_func looks like this (just a general example):

def some_calc_func(prev_result, prev_num, current_b):
    if current_b == 1:
        return prev_result * prev_num / 2
    else:
        return prev_num + 17

Please answer with respect to some_calc_func

like image 896
snatchysquid Avatar asked Jun 03 '20 10:06

snatchysquid


2 Answers

If you want to keep the function some_calc_func and not use another library, you should not try to access each element at each iteration, you can use zip on the columns nums and b with a shift between both as you try to access nums from the previous row and keep in memory the prev_res at each iteration. Also, append to a list instead of the dataframe, and after the loop assign the list to the column.

prev_res = df.loc[0, 'result'] #get first result
l_res = [prev_res] #initialize the list of results
# loop with zip to get both values at same time, 
# use loc to start b at second row but not num
for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):
    # use your function to calculate the new prev_res
    prev_res = some_calc_func (prev_res, prev_num, curren_b)
    # add to the list of results
    l_res.append(prev_res)
# assign to the column
df['result'] = l_res
print (df) #same result than with your method
   nums  b  result
0  20.0  1    20.0
1  22.0  0    37.0
2  30.0  1   407.0
3  29.1  1  6105.0
4  20.0  0    46.1

Now with a dataframe df of 5000 rows, I got:

%%timeit
prev_res = df.loc[0, 'result']
l_res = [prev_res]
for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):
    prev_res = some_calc_func (prev_res, prev_num, curren_b)
    l_res.append(prev_res)
df['result'] = l_res
# 4.42 ms ± 695 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

and with your original solution, it was ~750x slower

%%timeit 
for i in range(1, len(df.index)):
    row = df.index[i]
    new_row = df.index[i - 1]  # get index of previous row for "nums" and "result"
    df.loc[row, 'result'] = some_calc_func(prev_result=df.loc[new_row, 'result'], prev_num=df.loc[new_row, 'nums'], \
                             current_b=df.loc[row, 'b'])
#3.25 s ± 392 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

EDIT with another library called numba, if the function some_calc_func can be easily used with Numba decorator.

from numba import jit
# decorate your function
@jit
def some_calc_func(prev_result, prev_num, current_b):
    if current_b == 1:
        return prev_result * prev_num / 2
    else:
        return prev_num + 17

# create a function to do your job
# numba likes numpy arrays
@jit
def with_numba(prev_res, arr_nums, arr_b):
    # array for results and initialize
    arr_res = np.zeros_like(arr_nums)
    arr_res[0] = prev_res
    # loop on the length of arr_b
    for i in range(len(arr_b)):
        #do the calculation and set the value in result array
        prev_res = some_calc_func (prev_res, arr_nums[i], arr_b[i])
        arr_res[i+1] = prev_res
    return arr_res

Finally, call it like

df['result'] = with_numba(df.loc[0, 'result'], 
                          df['nums'].to_numpy(),  
                          df.loc[1:, 'b'].to_numpy())

And with a timeit, I get another ~9x faster than my method with zip, and the speed up could increase with the size

%timeit df['result'] = with_numba(df.loc[0, 'result'], 
                                  df['nums'].to_numpy(),  
                                  df.loc[1:, 'b'].to_numpy()) 
# 526 µs ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Note using Numba might be problematic depending on your actual some_calc_func

like image 157
Ben.T Avatar answered Oct 12 '22 23:10

Ben.T


IIUC:

>>> df['result'] = (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums
                    ).fillna(df.result).cumsum()

>>> df
   nums  b  result
0  20.0  1    20.0
1  22.0  0    42.0
2  30.0  1    12.0
3  29.1  1   -17.1
4  20.0  0     2.9

Explanation:

# replace 0 with 1 and 1 with -1 in column `b` for rows where result==0
>>> df[df.result.eq(0)].b.replace({0: 1, 1: -1})
1    1
2   -1
3   -1
4    1
Name: b, dtype: int64

# multiply with nums
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums)
0     NaN
1    22.0
2   -30.0
3   -29.1
4    20.0
dtype: float64

# fill the 'NaN' with the corresponding value from df.result (which is 20 here)
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result)
0    20.0
1    22.0
2   -30.0
3   -29.1
4    20.0
dtype: float64

# take the cumulative sum (cumsum)
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result).cumsum()
0    20.0
1    42.0
2    12.0
3   -17.1
4     2.9
dtype: float64

According to your requirement in comments, I can not think of a way without loops:

c1, c2 = 2, 1
l = [df.loc[0, 'result']]            # store the first result in a list

# then loop over the series (df.b * df.nums)

for i, val in (df.b * df.nums).iteritems():
    if i:                            # except for 0th index
        if val == 0:                 # (df.b * df.nums) == 0 if df.b == 0
            l.append(l[-1])          # append the last result
        else:                        # otherwise apply the rule
            t = l[-1] *c2 + val * c1
            l.append(t)

>>> l
[20.0, 20.0, 80.0, 138.2, 138.2]

>>> df['result'] = l

   nums  b  result
0  20.0  1    20.0
1  22.0  0    20.0
2  30.0  1    80.0   # [ 20 * 1 +   30 * 2]
3  29.1  1   138.2   # [ 80 * 1 + 29.1 * 2]
4  20.0  0   138.2

Seems fast enough, did not test for large sample.

like image 33
Sayandip Dutta Avatar answered Oct 12 '22 23:10

Sayandip Dutta