Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find (only) the first row satisfying a given condition in pandas DataFrame

Tags:

python

pandas

I have a dataframe df with a very long column of random positive integers:

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})

I want to determine the index of the first even number in the column. One way to do this is:

df[df.n % 2 == 0].iloc[0]

but this involves a lot of operations (generate the indices f.n % 2 == 0, evaluate df on those indices and finally take the first item) and is very slow. A loop like this is much quicker:

for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break

also because the first result will be probably in the first few lines. Is there any pandas method for doing this with similar performance? Thank you.

NOTE: This condition (to be an even number) is just an example. I'm looking for a solution that works for any kind of condition on the values, i.e., for a fast one-line alternative to:

df[ conditions on df.n ].iloc[0]
like image 416
peter Avatar asked Dec 01 '17 20:12

peter


2 Answers

I decided for fun to play with a few possibilities. I take a dataframe:

MAX = 10**7
df = pd.DataFrame({'n': range(MAX)})

(not random this time.) I want to find the first row for which n >= N for some value of N. I have timed the following four versions:

def getfirst_pandas(condition, df):
    return df[condition(df)].iloc[0]

def getfirst_iterrows_loop(condition, df):
    for index, row in df.iterrows():
        if condition(row):
            return index, row
    return None

def getfirst_for_loop(condition, df):
    for j in range(len(df)):
        if condition(df.iloc[j]):
            break
    return j

def getfirst_numpy_argmax(condition, df):
    array = df.as_matrix()
    imax  = np.argmax(condition(array))
    return df.index[imax]

with N = powers of ten. Of course the numpy (optimized C) code is expected to be faster than for loops in python, but I wanted to see for which values of N python loops are still okay.

I timed the lines:

getfirst_pandas(lambda x: x.n >= N, df)
getfirst_iterrows_loop(lambda x: x.n >= N, df)
getfirst_for_loop(lambda x: x.n >= N, df)
getfirst_numpy_argmax(lambda x: x >= N, df.n)

for N = 1, 10, 100, 1000, .... This is the log-log graph of the performance:

PICTURE

The simple for loop is ok as long as the "first True position" is expected to be at the beginning, but then it becomes bad. The np.argmax is the safest solution.

As you can see from the graph, the time for pandas and argmax remain (almost) constant, because they always scan the whole array. It would be perfect to have a np or pandas method which doesn't.

like image 131
peter Avatar answered Sep 18 '22 06:09

peter


Did some timings and yes using a generator will normally give you quicker results

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})

%timeit df[df.n % 2 == 0].iloc[0]
%timeit df.iloc[next(k for k,v in df.iterrows() if v.n % 2 == 0)]
%timeit df.iloc[next(t[0] for t in df.itertuples() if t.n % 2 == 0)]

I get:

1000 loops, best of 3: 1.09 ms per loop
1000 loops, best of 3: 619 µs per loop # <-- iterrows generator
1000 loops, best of 3: 1.1 ms per loop
10000 loops, best of 3: 25 µs per loop # <--- your solution

However when you size it up:

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 1000000)})

The difference disappear:

10 loops, best of 3: 40.5 ms per loop 
10 loops, best of 3: 40.7 ms per loop # <--- iterrows
10 loops, best of 3: 56.9 ms per loop

Your solution is quickest, so why not use it?

for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break
like image 32
Anton vBR Avatar answered Sep 18 '22 06:09

Anton vBR