Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently iterate a pandas DataFrame and increment a NumPy array on these values?

My pandas/numpy is rusty, and the code I have written feels inefficient.

I'm initializing a numpy array of zeros in Python3.x, length 1000. For my purpose, these are simply integers:

import numpy as np
array_of_zeros =  np.zeros((1000, ), )

I also have the following DataFrame (which is much smaller than my actual data)

import pandas as pd
dict1 = {'start' : [100, 200, 300], 'end':[400, 500, 600]}
df = pd.DataFrame(dict1)
print(df)
##
##    start     end
## 0    100     400
## 1    200     500
## 2    300     600

The DataFrame has two columns, start and end. These values represent a range of values, i.e. start will always be a smaller integer than end. Above, we see the first row has the range 100-400, next is 200-500, and then 300-600.

My goal is to iterate through the pandas DataFrame row by row, and increment the numpy array array_of_zeros based on these index positions. So, if there is a row in the dataframe of 10 to 20, I would like to increment the zero by +1 for the indices 10-20.

Here is the code which does what I would like:

import numpy as np
array_of_zeros =  np.zeros((1000, ), )

import pandas as pd
dict1 = {'start' : [100, 200, 300], 'end':[400, 500, 600]}
df = pd.DataFrame(dict1)
print(df)

for idx, row in df.iterrows():
    for i in range(int(row.start), int(row.end)+1):
        array_of_zeros[i]+=1

And it works!

print(array_of_zeros[15])
## output: 0.0
print(array_of_zeros[600])
## output: 1.0
print(array_of_zeros[400])
## output: 3.0
print(array_of_zeros[100])
## output: 1.0
print(array_of_zeros[200])
## output: 2.0

My questions: this is very clumsy code! I shouldn't be using so many for-loops with numpy arrays! This solution will be very inefficient if the input dataframe is quite large

Is there a more efficient (i.e. more numpy-based) method to avoid this for-loop?

for i in range(int(row.start), int(row.end)+1):
    array_of_zeros[i]+=1

Perhaps there is a pandas-oriented solution?

like image 888
ShanZhengYang Avatar asked Aug 30 '18 16:08

ShanZhengYang


3 Answers

You can use NumPy array indexing to avoid the inner loop, i.e. res[np.arange(A[i][0], A[i][1]+1)] += 1, but this isn't efficient as it involves creating a new array and using advanced indexing.

Instead, you can use numba1 to optimize your algorithm, exactly as it stands. The below example shows a massive performance improvement by moving performance-critical logic to JIT-compiled code.

from numba import jit

@jit(nopython=True)
def jpp(A):
    res = np.zeros(1000)
    for i in range(A.shape[0]):
        for j in range(A[i][0], A[i][1]+1):
            res[j] += 1
    return res

Some benchmarking results:

# Python 3.6.0, NumPy 1.11.3

# check result the same
assert (jpp(df[['start', 'end']].values) == original(df)).all()
assert (pir(df) == original(df)).all()
assert (pir2(df) == original(df)).all()

# time results
df = pd.concat([df]*10000)

%timeit jpp(df[['start', 'end']].values)  # 64.6 µs per loop
%timeit original(df)                      # 8.25 s per loop
%timeit pir(df)                           # 208 ms per loop
%timeit pir2(df)                          # 1.43 s per loop

Code using for benchmarking:

def original(df):
    array_of_zeros = np.zeros(1000)
    for idx, row in df.iterrows():
        for i in range(int(row.start), int(row.end)+1):
            array_of_zeros[i]+=1   
    return array_of_zeros

def pir(df):
    return np.bincount(np.concatenate([np.arange(a, b + 1) for a, b in \
                       zip(df.start, df.end)]), minlength=1000)

def pir2(df):
    a = np.zeros((1000,), np.int64)
    for b, c in zip(df.start, df.end):
        np.add.at(a, np.arange(b, c + 1), 1)
    return a

1 For posterity, I'm including @piRSquared's excellent comment on why numba helps here:

numba's advantage is looping very efficiently. Though it can understand much of NumPy's API, it is often better to avoid creating NumPy objects within a loop. My code is creating a NumPy array for every row in the dataframe. Then concatenating them prior to using bincount. @jpp's numba code creates very little extra objects and utilizes much of what is already there. The difference between my NumPy solution and @jpp's numba solution is about 4-5 times. Both are linear and should be pretty quick.

like image 161
jpp Avatar answered Nov 14 '22 23:11

jpp


numpy.bincount

np.bincount(np.concatenate(
    [np.arange(a, b + 1) for a, b in zip(df.start, df.end)]
), minlength=1000)

numpy.add.at

a = np.zeros((1000,), np.int64)
for b, c in zip(df.start, df.end):
  np.add.at(a, np.arange(b, c + 1), 1)
like image 41
piRSquared Avatar answered Nov 14 '22 22:11

piRSquared


My solution

for x, y in zip(df.start, df.end):
    array_of_zeros[x:y+1]+=1
like image 22
BENY Avatar answered Nov 15 '22 00:11

BENY