Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Check if column of strings in column of lists row wise

I have a dataframe which you can build with this:

dflist=[['123',['abc','qw3','123']],
        ['ab12',['3e4r5','12we3','asd23','q2w3']]]
df=pd.DataFrame(dflist,columns=['check','checklist'])

And looks like this:

  check                    checklist
0   123              [abc, qw3, 123]
1  ab12  [3e4r5, 12we3, asd23, q2w3]

I want to check if the item in column "check" is in the list in column "checklist". So I want the resulting dataframe to look like:

  check                    checklist checkisin
0   123              [abc, qw3, 123]      True
1  ab12  [3e4r5, 12we3, asd23, q2w3]     False

I have tried several things including using .isin in various forms including apply/lambda. and directly.

This:

df['checkisin']=df.check.isin(df.checklist)

produces:

  check                    checklist  checkisin
0   123              [abc, qw3, 123]      False
1  ab12  [3e4r5, 12we3, asd23, q2w3]      False

which has two Falses.

Trying this: df['checkisin']=df.apply(lambda x:x.check.isin(x.checklist)) gives this error:

AttributeError: ("'Series' object has no attribute 'check'", 'occurred at index check')

Trying this:

df['checkisin']=df.apply(lambda x:x['check'] in x.checklist)

gives this error:

KeyError: ('check', 'occurred at index check')

I'm sure I'm missing something simple here. I know I could loop this, but looking for a Pandas Dataframe column wise solution as the DF I have is very large and trying to "most" efficiently handle.

Thanks!

like image 491
clg4 Avatar asked Jun 30 '19 01:06

clg4


3 Answers

You have a column of lists, of course pandas doesn't have any functions that natively support operations on data structured this badly. If you want the most performance, I'd recommend a list comprehension:

df['checkisin'] = [c in l for c, l in zip(df['check'], df['checklist'])]
df
  check                    checklist  checkisin
0   123              [abc, qw3, 123]       True
1  ab12  [3e4r5, 12we3, asd23, q2w3]      False

You might consider implementing try-except error handling if you're worried about NaNs and type mismatches:

def check_isin(check, checklist):
    try:
        return check in checklist
    except TypeError:
        return np.NaN

df['checkisin'] = [
    check_isin(c, l) for c, l in zip(df['check'], df['checklist'])
]

Evidence suggests list comprehensions are the most desirable option for operations that cannot be vectorized.

PS, consider converting your column of lists into a column of sets if you plan to do membership tests a lot.


Here's an example of how this operation can be vectorized.

from itertools import chain

cl = df.pop('checklist')
df = (pd.DataFrame(df.reset_index().values.repeat(cl.str.len(), axis=0), 
                   columns=['group', *df.columns])
        .assign(checklist=list(chain.from_iterable(cl))))

df

   group check checklist
0      0   123       abc
1      0   123       qw3
2      0   123       123
3      1  ab12     3e4r5
4      1  ab12     12we3
5      1  ab12     asd23
6      1  ab12      q2w3
7      1  ab12       123

(df['check'] == df['checklist']).groupby(df.group).any()

group
0     True
1    False
dtype: bool
like image 77
cs95 Avatar answered Nov 07 '22 23:11

cs95


I will using isin with Series, isin with Series will match the index first which is different from list

pd.DataFrame(df.checklist.tolist(),index=df.index).isin(df.check).any(1)
Out[496]: 
0     True
1    False
dtype: bool

Or

pd.DataFrame(df.checklist.tolist(),index=df.index).eq(df.check,0).any(1)
like image 36
BENY Avatar answered Nov 07 '22 23:11

BENY


map

df.assign(checkisin=[*map(lambda s, x: s in x, *map(df.get, df))])

  check                    checklist  checkisin
0   123              [abc, qw3, 123]       True
1  ab12  [3e4r5, 12we3, asd23, q2w3]      False

If your dataframe has more columns, you can be more explicit

cols = ['check', 'checklist']
df.assign(checkisin=[*map(lambda s, x: s in x, *map(df.get, cols))])
like image 24
piRSquared Avatar answered Nov 07 '22 23:11

piRSquared