Say I have the following dataframe df
:
A B C
0 mom;dad;son; sister;son; yes;no;maybe;
1 dad; daughter;niece; no;snow;
2 son;dad; cat;son;dad; tree;dad;son;
3 daughter;mom; niece; referee;
4 dad;daughter; cat; dad;
And you want to check if, between columns A
, B
, and C
, there is a common word, and create a column D
with 1
if there is and 0
if there isn't any. For a word to be common, it's enough for it to appear in just two of the three columns.
The outcome should be:
A B C D
0 mom;dad;son; sister;son; yes;no;maybe; 1
1 dad; daughter;niece; no;snow; 0
2 son;dad; cat;son;dad; tree;dad;son; 1
3 daughter;mom; niece; referee; 0
4 dad;daughter; cat; dad; 1
I am trying to implement this by doing:
for index, row in df.iterrows():
w1=row['A'].split(';')
w2=row['B'].split(';')
w3=row['C'].split(';')
if len(set(w1).intersection(w2))>0 or len(set(w1).intersection(w3))>0 or len(set(w2).intersection(w3))>0:
df['D'][index]==1
else:
df['D'][index]==0
However, the resulting D
column only bears 0
because (possibly) I am not comparing each individual word in w1 to the others in w2 and w3. How could I achieve this?
Use stack
+ pandas.Series.str.get_dummies
df.assign(
D=df.stack().str.get_dummies(';').sum(level=0).gt(1).any(1).astype(int)
)
A B C D
0 mom;dad;son; sister;son; yes;no;maybe; 1
1 dad; daughter;niece; no;snow; 0
2 son;dad; cat;son;dad; tree;dad;son; 1
3 daughter;mom; niece; referee; 0
4 dad;daughter; cat; dad; 1
Notice that when we stack and get dummies, the interim result looks like this:
cat dad daughter maybe mom niece no referee sister snow son tree yes
0 A 0 1 0 0 1 0 0 0 0 0 1 0 0
B 0 0 0 0 0 0 0 0 1 0 1 0 0
C 0 0 0 1 0 0 1 0 0 0 0 0 1
1 A 0 1 0 0 0 0 0 0 0 0 0 0 0
B 0 0 1 0 0 1 0 0 0 0 0 0 0
C 0 0 0 0 0 0 1 0 0 1 0 0 0
2 A 0 1 0 0 0 0 0 0 0 0 1 0 0
B 1 1 0 0 0 0 0 0 0 0 1 0 0
C 0 1 0 0 0 0 0 0 0 0 1 1 0
3 A 0 0 1 0 1 0 0 0 0 0 0 0 0
B 0 0 0 0 0 1 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 1 0 0 0 0 0
4 A 0 1 1 0 0 0 0 0 0 0 0 0 0
B 1 0 0 0 0 0 0 0 0 0 0 0 0
C 0 1 0 0 0 0 0 0 0 0 0 0 0
Where the prior columns are embedded in the second level of the index. So I want to sum over the first level in order to see how many times that word appears.
That summation interim looks like:
cat dad daughter maybe mom niece no referee sister snow son tree yes
0 0 1 0 1 1 0 1 0 1 0 2 0 1
1 0 1 1 0 0 1 1 0 0 1 0 0 0
2 1 3 0 0 0 0 0 0 0 0 3 1 0
3 0 0 1 0 1 1 0 1 0 0 0 0 0
4 1 2 1 0 0 0 0 0 0 0 0 0 0
Notice that we catch 'son'
in row 1, 'dad'
and 'son'
in row 3 and so on.
If it appears in more than 1 column (hence gt(1)
) then I want to count it as a 1
(hence any(1).astype(int)
).
This one-liner creates what you need, using collections.Counter:
from collections import Counter
df['D'] = df.applymap(lambda x: [i for i in x.split(';') if i]).apply(lambda x: int(Counter(x.A+x.B+x.C).most_common(1)[0][1]!=1), axis=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With