DataFrame
df = pd.DataFrame({'A': [['gener'], ['gener'], ['system'], ['system'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum', 'toledo']], 'B': [['gutter'], ['gutter'], ['gutter', 'system'], ['gutter', 'guard', 'system'], ['ohio', 'gutter'], ['gutter', 'toledo'], ['toledo', 'gutter'], ['gutter'], ['gutter'], ['gutter'], ['how', 'to', 'instal', 'aluminum', 'gutter'], ['aluminum', 'gutter'], ['aluminum', 'gutter', 'color'], ['aluminum', 'gutter'], ['aluminum', 'gutter', 'adrian', 'ohio'], ['aluminum', 'gutter', 'bowl', 'green', 'ohio'], ['aluminum', 'gutter', 'maume', 'ohio'], ['aluminum', 'gutter', 'perrysburg', 'ohio'], ['aluminum', 'gutter', 'tecumseh', 'ohio'], ['aluminum', 'gutter', 'toledo', 'ohio']]}, columns=['A', 'B'])
What it Looks Like
I have a dataframe with two columns of lists.
A B
0 [gener] [gutter]
1 [gener] [gutter]
2 [system] [gutter, system]
3 [system] [gutter, guard, system]
4 [gutter] [ohio, gutter]
5 [gutter] [gutter, toledo]
6 [gutter] [toledo, gutter]
7 [gutter] [gutter]
8 [gutter] [gutter]
9 [gutter] [gutter]
10 [aluminum] [how, to, instal, aluminum, gutter]
11 [aluminum] [aluminum, gutter]
12 [aluminum] [aluminum, gutter, color]
13 [aluminum] [aluminum, gutter]
14 [aluminum] [aluminum, gutter, adrian, ohio]
15 [aluminum] [aluminum, gutter, bowl, green, ohio]
16 [aluminum] [aluminum, gutter, maume, ohio]
17 [aluminum] [aluminum, gutter, perrysburg, ohio]
18 [aluminum] [aluminum, gutter, tecumseh, ohio]
19 [aluminum, toledo] [aluminum, gutter, toledo, ohio]
Question
If I have columns of lists, is there a pandas function that lets me operate on the entire array of lists to check for intersection and return either a boolean or the intersecting values as a new series?
For example, I'd like pandas to have an equivalent of this:
def intersection(df, col1, col2, return_type='boolean'):
if return_type == 'boolean':
df = df[[col1, col2]]
s = []
for idx in df.iterrows():
s.append(any([phrase in idx[1][0] for phrase in idx[1][1]]))
S = pd.Series(s)
return S
elif return_type == 'word':
df = df[[col1, col2]]
s = []
for idx in df.iterrows():
s.append(', '.join([word for word in list(set(idx[1][0]).intersection(set(idx[1][1])))]))
S = pd.Series(s)
return S
#Create column C in df
df['C'] = intersection(df, 'A', 'B', 'word')
... without having to write my own function or resort to for loops. I feel like there must be an easier way to compare lists in two columns on the same row to see if they intersect.
I can do it with for
loops but it's ugly to me
for
loop to return a boolean
series:
for idx in df.iterrows():
any([phrase in idx[1][0] for phrase in idx[1][1]])
Produces:
False
False
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
Or, finding the intersecting words using set
s:
for idx in df.iterrows():
', '.join([word for word in list(set(idx[1][0]).intersection(set(idx[1][1])))])
''
''
'system'
'system'
'gutter'
'gutter'
'gutter'
'gutter'
'gutter'
'gutter'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'toledo, aluminum'
By using the Where() method in NumPy, we are given the condition to compare the columns. If 'column1' is lesser than 'column2' and 'column1' is lesser than the 'column3', We print the values of 'column1'. If the condition fails, we give the value as 'NaN'. These results are stored in the new column in the dataframe.
You can use the DataFrame. diff() function to find the difference between two rows in a pandas DataFrame. where: periods: The number of previous rows for calculating the difference.
To check if every item in df.A
is contained in df.B
:
>>> df.apply(lambda row: all(i in row.B for i in row.A), axis=1)
# OR: ~(df['A'].apply(set) - df['B'].apply(set)).astype(bool)
0 False
1 False
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
19 True
dtype: bool
To get the union:
df['intersection'] = [list(set(a).intersection(set(b)))
for a, b in zip(df.A, df.B)]
>>> df
A B intersection
0 [gener] [gutter] []
1 [gener] [gutter] []
2 [system] [gutter, system] [system]
3 [system] [gutter, guard, system] [system]
4 [gutter] [ohio, gutter] [gutter]
5 [gutter] [gutter, toledo] [gutter]
6 [gutter] [toledo, gutter] [gutter]
7 [gutter] [gutter] [gutter]
8 [gutter] [gutter] [gutter]
9 [gutter] [gutter] [gutter]
10 [aluminum] [how, to, instal, aluminum, gutter] [aluminum]
11 [aluminum] [aluminum, gutter] [aluminum]
12 [aluminum] [aluminum, gutter, color] [aluminum]
13 [aluminum] [aluminum, gutter] [aluminum]
14 [aluminum] [aluminum, gutter, adrian, ohio] [aluminum]
15 [aluminum] [aluminum, gutter, bowl, green, ohio] [aluminum]
16 [aluminum] [aluminum, gutter, maume, ohio] [aluminum]
17 [aluminum] [aluminum, gutter, perrysburg, ohio] [aluminum]
18 [aluminum] [aluminum, gutter, tecumseh, ohio] [aluminum]
19 [aluminum, toledo] [aluminum, gutter, toledo, ohio] [aluminum, toledo]
Just use the apply
function supported by pandas
, it's great.
Since you may have more than two columns for intersecting, the auxiliary function can be prepared like this and then applied with the DataFrame.apply
function (see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html, note the option axis=1
means "across the series" while axis=0
means "along the series", where one
series is just one column in the data frame). Each row across the columns is then passed as a iterable Series
object to the function applied.
def intersect(ss):
ss = iter(ss)
s = set(next(ss))
for t in ss:
s.intersection_update(t) # `t' must not be a `set' here, `list' or any `Iterable` is OK
return s
res = df.apply(intersect, axis=1)
>>> res
0 {}
1 {}
2 {system}
3 {system}
4 {gutter}
5 {gutter}
6 {gutter}
7 {gutter}
8 {gutter}
9 {gutter}
10 {aluminum}
11 {aluminum}
12 {aluminum}
13 {aluminum}
14 {aluminum}
15 {aluminum}
16 {aluminum}
17 {aluminum}
18 {aluminum}
19 {aluminum, toledo}
You can augment further operations on the result of the auxiliary function, or make some variations similarly.
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With