I have a dataframe which you can build with this:
dflist=[['123',['abc','qw3','123']],
['ab12',['3e4r5','12we3','asd23','q2w3']]]
df=pd.DataFrame(dflist,columns=['check','checklist'])
And looks like this:
check checklist
0 123 [abc, qw3, 123]
1 ab12 [3e4r5, 12we3, asd23, q2w3]
I want to check if the item in column "check" is in the list in column "checklist". So I want the resulting dataframe to look like:
check checklist checkisin
0 123 [abc, qw3, 123] True
1 ab12 [3e4r5, 12we3, asd23, q2w3] False
I have tried several things including using .isin in various forms including apply/lambda. and directly.
This:
df['checkisin']=df.check.isin(df.checklist)
produces:
check checklist checkisin
0 123 [abc, qw3, 123] False
1 ab12 [3e4r5, 12we3, asd23, q2w3] False
which has two Falses.
Trying this: df['checkisin']=df.apply(lambda x:x.check.isin(x.checklist)) gives this error:
AttributeError: ("'Series' object has no attribute 'check'", 'occurred at index check')
Trying this:
df['checkisin']=df.apply(lambda x:x['check'] in x.checklist)
gives this error:
KeyError: ('check', 'occurred at index check')
I'm sure I'm missing something simple here. I know I could loop this, but looking for a Pandas Dataframe column wise solution as the DF I have is very large and trying to "most" efficiently handle.
Thanks!
You have a column of lists, of course pandas doesn't have any functions that natively support operations on data structured this badly. If you want the most performance, I'd recommend a list comprehension:
df['checkisin'] = [c in l for c, l in zip(df['check'], df['checklist'])]
df
check checklist checkisin
0 123 [abc, qw3, 123] True
1 ab12 [3e4r5, 12we3, asd23, q2w3] False
You might consider implementing try-except error handling if you're worried about NaNs and type mismatches:
def check_isin(check, checklist):
try:
return check in checklist
except TypeError:
return np.NaN
df['checkisin'] = [
check_isin(c, l) for c, l in zip(df['check'], df['checklist'])
]
Evidence suggests list comprehensions are the most desirable option for operations that cannot be vectorized.
PS, consider converting your column of lists into a column of sets if you plan to do membership tests a lot.
Here's an example of how this operation can be vectorized.
from itertools import chain
cl = df.pop('checklist')
df = (pd.DataFrame(df.reset_index().values.repeat(cl.str.len(), axis=0),
columns=['group', *df.columns])
.assign(checklist=list(chain.from_iterable(cl))))
df
group check checklist
0 0 123 abc
1 0 123 qw3
2 0 123 123
3 1 ab12 3e4r5
4 1 ab12 12we3
5 1 ab12 asd23
6 1 ab12 q2w3
7 1 ab12 123
(df['check'] == df['checklist']).groupby(df.group).any()
group
0 True
1 False
dtype: bool
I will using isin
with Series
, isin with Series will match the index
first which is different from list
pd.DataFrame(df.checklist.tolist(),index=df.index).isin(df.check).any(1)
Out[496]:
0 True
1 False
dtype: bool
Or
pd.DataFrame(df.checklist.tolist(),index=df.index).eq(df.check,0).any(1)
map
df.assign(checkisin=[*map(lambda s, x: s in x, *map(df.get, df))])
check checklist checkisin
0 123 [abc, qw3, 123] True
1 ab12 [3e4r5, 12we3, asd23, q2w3] False
If your dataframe has more columns, you can be more explicit
cols = ['check', 'checklist']
df.assign(checkisin=[*map(lambda s, x: s in x, *map(df.get, cols))])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With