I want to compare two columns: Description
and Employer
. I want to see if any keywords from Employer
are found in the Description
column. I have broken the Employer
column down to words and converted to a list. Now I want to see if any of those words are in the corresponding Description
column.
Sample input:
print(df.head(25))
Date Description Amount AutoNumber \
0 3/17/2015 WW120 TFR?FR xxx8690 140.00 49246
2 3/13/2015 JX154 TFR?FR xxx8690 150.00 49246
5 3/6/2015 CANSEL SURVEY E PAY 1182.08 49246
9 3/2/2015 UE200 TFR?FR xxx8690 180.00 49246
10 2/27/2015 JH401 TFR?FR xxx8690 400.00 49246
11 2/27/2015 CANSEL SURVEY E PAY 555.62 49246
12 2/25/2015 HU204 TFR?FR xxx8690 200.00 49246
13 2/23/2015 UQ263 TFR?FR xxx8690 102.00 49246
14 2/23/2015 UT460 TFR?FR xxx8690 200.00 49246
15 2/20/2015 CANSEL SURVEY E PAY 1222.05 49246
17 2/17/2015 UO414 TFR?FR xxx8690 250.00 49246
19 2/11/2015 HI540 TFR?FR xxx8690 130.00 49246
20 2/11/2015 HQ010 TFR?FR xxx8690 177.00 49246
21 2/10/2015 WU455 TFR?FR xxx8690 200.00 49246
22 2/6/2015 JJ500 TFR?FR xxx8690 301.00 49246
23 2/6/2015 CANSEL SURVEY E PAY 1182.08 49246
24 2/5/2015 IR453 TFR?FR xxx8690 168.56 49246
26 2/2/2015 RQ574 TFR?FR xxx8690 500.00 49246
27 2/2/2015 UT022 TFR?FR xxx8690 850.00 49246
28 12/31/2014 HU521 TFR?FR xxx8690 950.17 49246
Employer
0 Cansel Survey Equipment
2 Cansel Survey Equipment
5 Cansel Survey Equipment
9 Cansel Survey Equipment
10 Cansel Survey Equipment
11 Cansel Survey Equipment
12 Cansel Survey Equipment
13 Cansel Survey Equipment
14 Cansel Survey Equipment
15 Cansel Survey Equipment
17 Cansel Survey Equipment
19 Cansel Survey Equipment
20 Cansel Survey Equipment
21 Cansel Survey Equipment
22 Cansel Survey Equipment
23 Cansel Survey Equipment
24 Cansel Survey Equipment
26 Cansel Survey Equipment
27 Cansel Survey Equipment
28 Cansel Survey Equipment
I tried something like this, but it doesn't seem to work.:
df['Text_Search'] = df['Employer'].apply(lambda x: x.split(" "))
df['Match'] = np.where(df['Description'].str.contains("|".join(df['Text_Search'])), "Yes", "No")
My desired output would be as shown below:
Date Description Amount AutoNumber \
0 3/17/2015 WW120 TFR?FR xxx8690 140.00 49246
2 3/13/2015 JX154 TFR?FR xxx8690 150.00 49246
5 3/6/2015 CANSEL SURVEY E PAY 1182.08 49246
9 3/2/2015 UE200 TFR?FR xxx8690 180.00 49246
10 2/27/2015 JH401 TFR?FR xxx8690 400.00 49246
11 2/27/2015 CANSEL SURVEY E PAY 555.62 49246
12 2/25/2015 HU204 TFR?FR xxx8690 200.00 49246
13 2/23/2015 UQ263 TFR?FR xxx8690 102.00 49246
14 2/23/2015 UT460 TFR?FR xxx8690 200.00 49246
15 2/20/2015 CANSEL SURVEY E PAY 1222.05 49246
17 2/17/2015 UO414 TFR?FR xxx8690 250.00 49246
19 2/11/2015 HI540 TFR?FR xxx8690 130.00 49246
20 2/11/2015 HQ010 TFR?FR xxx8690 177.00 49246
21 2/10/2015 WU455 TFR?FR xxx8690 200.00 49246
22 2/6/2015 JJ500 TFR?FR xxx8690 301.00 49246
23 2/6/2015 CANSEL SURVEY E PAY 1182.08 49246
24 2/5/2015 IR453 TFR?FR xxx8690 168.56 49246
26 2/2/2015 RQ574 TFR?FR xxx8690 500.00 49246
27 2/2/2015 UT022 TFR?FR xxx8690 850.00 49246
28 12/31/2014 HU521 TFR?FR xxx8690 950.17 49246
29 12/30/2014 WZ553 TFR?FR xxx8690 200.00 49246
32 12/29/2014 JW173 TFR?FR xxx8690 300.00 49246
33 12/24/2014 CANSEL SURVEY E PAY 1219.21 49246
34 12/24/2014 CANSEL SURVEY E PAY 434.84 49246
36 12/23/2014 WT002 TFR?FR xxx8690 160.00 49246
Employer Text_Search Match
0 Cansel Survey Equipment [Cansel, Survey, Equipment] No
2 Cansel Survey Equipment [Cansel, Survey, Equipment] No
5 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes
9 Cansel Survey Equipment [Cansel, Survey, Equipment] No
10 Cansel Survey Equipment [Cansel, Survey, Equipment] No
11 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes
12 Cansel Survey Equipment [Cansel, Survey, Equipment] No
13 Cansel Survey Equipment [Cansel, Survey, Equipment] No
14 Cansel Survey Equipment [Cansel, Survey, Equipment] No
15 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes
17 Cansel Survey Equipment [Cansel, Survey, Equipment] No
19 Cansel Survey Equipment [Cansel, Survey, Equipment] No
20 Cansel Survey Equipment [Cansel, Survey, Equipment] No
21 Cansel Survey Equipment [Cansel, Survey, Equipment] No
22 Cansel Survey Equipment [Cansel, Survey, Equipment] No
23 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes
24 Cansel Survey Equipment [Cansel, Survey, Equipment] No
26 Cansel Survey Equipment [Cansel, Survey, Equipment] No
27 Cansel Survey Equipment [Cansel, Survey, Equipment] No
28 Cansel Survey Equipment [Cansel, Survey, Equipment] No
29 Cansel Survey Equipment [Cansel, Survey, Equipment] No
32 Cansel Survey Equipment [Cansel, Survey, Equipment] No
33 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes
34 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes
36 Cansel Survey Equipment [Cansel, Survey, Equipment] No
tolist() to get a list of a specified column. From the dataframe, we select the column “Name” using a [] operator that returns a Series object. Next, we will use the function Series. to_list() provided by the Series class to convert the series object and return a list.
Using pandas. Pandas. Series. isin() function is used to check whether a column contains a list of multiple values. It returns a boolean Series showing each element in the Series matches an element in the passed sequence of values exactly.
Using List Comprehension on Pandas DataFrame. In real-world, we generally have data stored in either CSV or relational databases. We generally convert it to pandas dataframe and then we do data cleaning and manipulation. Hence it is important to learn how to use list comprehension on dataframe.
Let's prepare a bit bigger DF - 2.000 rows:
In [3]: df = pd.concat([df] * 10**2, ignore_index=True)
In [4]: df.shape
Out[4]: (2000, 5)
Solution 1: df.apply(..., axis=1)
:
df["Text_Search"] = df.Employer.str.lower().str.split().map(set)
In [15]: %%timeit
...: df.Description.str.lower().str.split().map(set).to_frame('desc') \
...: .apply(lambda r: (df["Text_Search"] & r.desc).any(),
...: axis=1)
...:
1 loop, best of 3: 5.06 s per loop
Solution 2: CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(min_df=1, lowercase=True)
In [8]: %%timeit
...: X = vect.fit_transform(df['Employer'])
...: cols_emp = vect.get_feature_names()
...: X = vect.fit_transform(df['Description'])
...: cols_desc = vect.get_feature_names()
...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
...: (X.toarray()[:, common_cols_idx] == 1).any(1)
...:
10 loops, best of 3: 88.2 ms per loop
Solution 3: df.apply(search_func, axis=1)
df["Text_Search"] = df["Employer"].str.lower().str.split()
In [12]: %%timeit
...: df.apply(search_func, axis=1)
...:
1 loop, best of 3: 362 ms per loop
NOTE: Solution 1
is too slow, so i will not "timeit" this solution for bigger DFs
Comparison of df.apply(search_func, axis=1)
and CountVectorizer
for 20.000 rows DF:
In [16]: df = pd.concat([df] * 10, ignore_index=True)
In [17]: df.shape
Out[17]: (20000, 6)
In [20]: %%timeit
...: df.apply(search_func, axis=1)
...:
1 loop, best of 3: 3.66 s per loop
In [21]: %%timeit
...: X = vect.fit_transform(df['Employer'])
...: cols_emp = vect.get_feature_names()
...: X = vect.fit_transform(df['Description'])
...: cols_desc = vect.get_feature_names()
...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
...: (X.toarray()[:, common_cols_idx] == 1).any(1)
...:
1 loop, best of 3: 825 ms per loop
Comparison of df.apply(search_func, axis=1)
and CountVectorizer
for 200.000 rows DF:
In [22]: df = pd.concat([df] * 10, ignore_index=True)
In [23]: df.shape
Out[23]: (200000, 6)
In [24]: %%timeit
...: df.apply(search_func, axis=1)
...:
1 loop, best of 3: 36.8 s per loop
In [25]: %%timeit
...: X = vect.fit_transform(df['Employer'])
...: cols_emp = vect.get_feature_names()
...: X = vect.fit_transform(df['Description'])
...: cols_desc = vect.get_feature_names()
...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
...: (X.toarray()[:, common_cols_idx] == 1).any(1)
...:
1 loop, best of 3: 8.28 s per loop
Conclusion: CountVectorized
solution is apporx. 4.44 times faster compared to df.apply(search_func, axis=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With