I have a dataframe df
with a very long column of random positive integers:
df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})
I want to determine the index of the first even number in the column. One way to do this is:
df[df.n % 2 == 0].iloc[0]
but this involves a lot of operations (generate the indices f.n % 2 == 0
, evaluate df
on those indices and finally take the first item) and is very slow. A loop like this is much quicker:
for j in range(len(df)):
if df.n.iloc[j] % 2 == 0:
break
also because the first result will be probably in the first few lines. Is there any pandas method for doing this with similar performance? Thank you.
NOTE: This condition (to be an even number) is just an example. I'm looking for a solution that works for any kind of condition on the values, i.e., for a fast one-line alternative to:
df[ conditions on df.n ].iloc[0]
I decided for fun to play with a few possibilities. I take a dataframe:
MAX = 10**7
df = pd.DataFrame({'n': range(MAX)})
(not random this time.) I want to find the first row for which n >= N
for some value of N
. I have timed the following four versions:
def getfirst_pandas(condition, df):
return df[condition(df)].iloc[0]
def getfirst_iterrows_loop(condition, df):
for index, row in df.iterrows():
if condition(row):
return index, row
return None
def getfirst_for_loop(condition, df):
for j in range(len(df)):
if condition(df.iloc[j]):
break
return j
def getfirst_numpy_argmax(condition, df):
array = df.as_matrix()
imax = np.argmax(condition(array))
return df.index[imax]
with N
= powers of ten. Of course the numpy (optimized C) code is expected to be faster than for
loops in python, but I wanted to see for which values of N
python loops are still okay.
I timed the lines:
getfirst_pandas(lambda x: x.n >= N, df)
getfirst_iterrows_loop(lambda x: x.n >= N, df)
getfirst_for_loop(lambda x: x.n >= N, df)
getfirst_numpy_argmax(lambda x: x >= N, df.n)
for N = 1, 10, 100, 1000, ...
. This is the log-log graph of the performance:
PICTURE
The simple for
loop is ok as long as the "first True position" is expected to be at the beginning, but then it becomes bad. The np.argmax
is the safest solution.
As you can see from the graph, the time for pandas
and argmax
remain (almost) constant, because they always scan the whole array. It would be perfect to have a np
or pandas
method which doesn't.
Did some timings and yes using a generator will normally give you quicker results
df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})
%timeit df[df.n % 2 == 0].iloc[0]
%timeit df.iloc[next(k for k,v in df.iterrows() if v.n % 2 == 0)]
%timeit df.iloc[next(t[0] for t in df.itertuples() if t.n % 2 == 0)]
I get:
1000 loops, best of 3: 1.09 ms per loop
1000 loops, best of 3: 619 µs per loop # <-- iterrows generator
1000 loops, best of 3: 1.1 ms per loop
10000 loops, best of 3: 25 µs per loop # <--- your solution
However when you size it up:
df = pd.DataFrame({'n': np.random.randint(1, 10, size = 1000000)})
The difference disappear:
10 loops, best of 3: 40.5 ms per loop
10 loops, best of 3: 40.7 ms per loop # <--- iterrows
10 loops, best of 3: 56.9 ms per loop
Your solution is quickest, so why not use it?
for j in range(len(df)):
if df.n.iloc[j] % 2 == 0:
break
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With