I have a pandas dataFrame
in which I would like to check if one column is contained
in another.
df = DataFrame({'A': ['some text here', 'another text', 'and this'],
'B': ['some', 'somethin', 'this']})
I would like to check if df.B[0]
is in df.A[0]
, df.B[1]
is in df.A[1]
etc.
I have the following apply
function implementation
df.apply(lambda x: x[1] in x[0], axis=1)
result is a Series
of [True, False, True]
which is fine, but for my dataFrame shape
(it is in the millions) it takes quite long.
Is there a better (i.e. faster) implamentation?
I tried the pandas.Series.str.contains
approach, but it can only take a string for the pattern.
df['A'].str.contains(df['B'], regex=False)
The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.
A key takeaway about pandas performance is that doing operations per row in pandas dataframes is typically slow, but using columns as series to do vectorised operations on (taking a whole column at a time) is typically fast.
The apply() Method — 811 times faster apply is not faster in itself but it has advantages when used in combination with DataFrames. This depends on the content of the apply expression. If it can be executed in Cython space, apply is much faster (which is the case here).
I wrote a very easy and fast script to benchmark built-in list comprehension and map functions vs pandas apply and map methods. Conclusion : built-in map and list comprehension are much more faster than pandas methods.
Use np.vectorize
- bypasses the apply
overhead, so should be a bit faster.
v = np.vectorize(lambda x, y: y in x)
v(df.A, df.B)
array([ True, False, True], dtype=bool)
Here's a timings comparison -
df = pd.concat([df] * 10000)
%timeit df.apply(lambda x: x[1] in x[0], axis=1)
1 loop, best of 3: 1.32 s per loop
%timeit v(df.A, df.B)
100 loops, best of 3: 5.55 ms per loop
# Psidom's answer
%timeit [b in a for a, b in zip(df.A, df.B)]
100 loops, best of 3: 3.34 ms per loop
Both are pretty competitive options!
Edit, adding timings for Wen's and Max's answers -
# Wen's answer
%timeit df.A.replace(dict(zip(df.B.tolist(),[np.nan]*len(df))),regex=True).isnull()
10 loops, best of 3: 49.1 ms per loop
# MaxU's answer
%timeit df['A'].str.split(expand=True).eq(df['B'], axis=0).any(1)
10 loops, best of 3: 87.8 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With