Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster implementation of pandas apply function

I have a pandas dataFrame in which I would like to check if one column is contained in another.

Suppose:

df = DataFrame({'A': ['some text here', 'another text', 'and this'], 
                'B': ['some', 'somethin', 'this']})

I would like to check if df.B[0] is in df.A[0], df.B[1] is in df.A[1] etc.

Current approach

I have the following apply function implementation

df.apply(lambda x: x[1] in x[0], axis=1)

result is a Series of [True, False, True]

which is fine, but for my dataFrame shape (it is in the millions) it takes quite long.
Is there a better (i.e. faster) implamentation?

Unsuccesfull approach

I tried the pandas.Series.str.contains approach, but it can only take a string for the pattern.

df['A'].str.contains(df['B'], regex=False)
like image 768
dimitris_ps Avatar asked Dec 25 '17 17:12

dimitris_ps


People also ask

Is apply faster than for loop Python pandas?

The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.

Is pandas apply slow?

A key takeaway about pandas performance is that doing operations per row in pandas dataframes is typically slow, but using columns as series to do vectorised operations on (taking a whole column at a time) is typically fast.

Is apply function faster than for loop Python?

The apply() Method — 811 times faster apply is not faster in itself but it has advantages when used in combination with DataFrames. This depends on the content of the apply expression. If it can be executed in Cython space, apply is much faster (which is the case here).

Is pandas apply faster than list comprehension?

I wrote a very easy and fast script to benchmark built-in list comprehension and map functions vs pandas apply and map methods. Conclusion : built-in map and list comprehension are much more faster than pandas methods.


1 Answers

Use np.vectorize - bypasses the apply overhead, so should be a bit faster.

v = np.vectorize(lambda x, y: y in x)

v(df.A, df.B)
array([ True, False,  True], dtype=bool)

Here's a timings comparison -

df = pd.concat([df] * 10000)

%timeit df.apply(lambda x: x[1] in x[0], axis=1)
1 loop, best of 3: 1.32 s per loop

%timeit v(df.A, df.B)
100 loops, best of 3: 5.55 ms per loop

# Psidom's answer
%timeit [b in a for a, b in zip(df.A, df.B)]
100 loops, best of 3: 3.34 ms per loop

Both are pretty competitive options!

Edit, adding timings for Wen's and Max's answers -

# Wen's answer
%timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
10 loops, best of 3: 49.1 ms per loop

# MaxU's answer
%timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
10 loops, best of 3: 87.8 ms per loop
like image 149
cs95 Avatar answered Sep 18 '22 01:09

cs95