My pandas/numpy is rusty, and the code I have written feels inefficient.
I'm initializing a numpy array of zeros in Python3.x, length 1000. For my purpose, these are simply integers:
import numpy as np
array_of_zeros = np.zeros((1000, ), )
I also have the following DataFrame (which is much smaller than my actual data)
import pandas as pd
dict1 = {'start' : [100, 200, 300], 'end':[400, 500, 600]}
df = pd.DataFrame(dict1)
print(df)
##
## start end
## 0 100 400
## 1 200 500
## 2 300 600
The DataFrame has two columns, start
and end
. These values represent a range of values, i.e. start
will always be a smaller integer than end
. Above, we see the first row has the range 100-400
, next is 200-500
, and then 300-600
.
My goal is to iterate through the pandas DataFrame row by row, and increment the numpy array array_of_zeros
based on these index positions. So, if there is a row in the dataframe of 10
to 20
, I would like to increment the zero by +1 for the indices 10-20.
Here is the code which does what I would like:
import numpy as np
array_of_zeros = np.zeros((1000, ), )
import pandas as pd
dict1 = {'start' : [100, 200, 300], 'end':[400, 500, 600]}
df = pd.DataFrame(dict1)
print(df)
for idx, row in df.iterrows():
for i in range(int(row.start), int(row.end)+1):
array_of_zeros[i]+=1
And it works!
print(array_of_zeros[15])
## output: 0.0
print(array_of_zeros[600])
## output: 1.0
print(array_of_zeros[400])
## output: 3.0
print(array_of_zeros[100])
## output: 1.0
print(array_of_zeros[200])
## output: 2.0
My questions: this is very clumsy code! I shouldn't be using so many for-loops with numpy arrays! This solution will be very inefficient if the input dataframe is quite large
Is there a more efficient (i.e. more numpy-based) method to avoid this for-loop?
for i in range(int(row.start), int(row.end)+1):
array_of_zeros[i]+=1
Perhaps there is a pandas-oriented solution?
You can use NumPy array indexing to avoid the inner loop, i.e. res[np.arange(A[i][0], A[i][1]+1)] += 1
, but this isn't efficient as it involves creating a new array and using advanced indexing.
Instead, you can use numba
1 to optimize your algorithm, exactly as it stands. The below example shows a massive performance improvement by moving performance-critical logic to JIT-compiled code.
from numba import jit
@jit(nopython=True)
def jpp(A):
res = np.zeros(1000)
for i in range(A.shape[0]):
for j in range(A[i][0], A[i][1]+1):
res[j] += 1
return res
Some benchmarking results:
# Python 3.6.0, NumPy 1.11.3
# check result the same
assert (jpp(df[['start', 'end']].values) == original(df)).all()
assert (pir(df) == original(df)).all()
assert (pir2(df) == original(df)).all()
# time results
df = pd.concat([df]*10000)
%timeit jpp(df[['start', 'end']].values) # 64.6 µs per loop
%timeit original(df) # 8.25 s per loop
%timeit pir(df) # 208 ms per loop
%timeit pir2(df) # 1.43 s per loop
Code using for benchmarking:
def original(df):
array_of_zeros = np.zeros(1000)
for idx, row in df.iterrows():
for i in range(int(row.start), int(row.end)+1):
array_of_zeros[i]+=1
return array_of_zeros
def pir(df):
return np.bincount(np.concatenate([np.arange(a, b + 1) for a, b in \
zip(df.start, df.end)]), minlength=1000)
def pir2(df):
a = np.zeros((1000,), np.int64)
for b, c in zip(df.start, df.end):
np.add.at(a, np.arange(b, c + 1), 1)
return a
1 For posterity, I'm including @piRSquared's excellent comment on why numba
helps here:
numba
's advantage is looping very efficiently. Though it can understand much of NumPy's API, it is often better to avoid creating NumPy objects within a loop. My code is creating a NumPy array for every row in the dataframe. Then concatenating them prior to using bincount. @jpp'snumba
code creates very little extra objects and utilizes much of what is already there. The difference between my NumPy solution and @jpp'snumba
solution is about 4-5 times. Both are linear and should be pretty quick.
numpy.bincount
np.bincount(np.concatenate(
[np.arange(a, b + 1) for a, b in zip(df.start, df.end)]
), minlength=1000)
numpy.add.at
a = np.zeros((1000,), np.int64)
for b, c in zip(df.start, df.end):
np.add.at(a, np.arange(b, c + 1), 1)
My solution
for x, y in zip(df.start, df.end):
array_of_zeros[x:y+1]+=1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With