Python/Pandas: How to Match List of Strings with a DataFrame column

Tags:

I want to compare two columns: Description and Employer. I want to see if any keywords from Employer are found in the Description column. I have broken the Employer column down to words and converted to a list. Now I want to see if any of those words are in the corresponding Description column.

Sample input:

print(df.head(25))


          Date           Description   Amount  AutoNumber  \
0    3/17/2015  WW120 TFR?FR xxx8690   140.00       49246   
2    3/13/2015  JX154 TFR?FR xxx8690   150.00       49246   
5     3/6/2015   CANSEL SURVEY E PAY  1182.08       49246   
9     3/2/2015  UE200 TFR?FR xxx8690   180.00       49246   
10   2/27/2015  JH401 TFR?FR xxx8690   400.00       49246   
11   2/27/2015   CANSEL SURVEY E PAY   555.62       49246   
12   2/25/2015  HU204 TFR?FR xxx8690   200.00       49246   
13   2/23/2015  UQ263 TFR?FR xxx8690   102.00       49246   
14   2/23/2015  UT460 TFR?FR xxx8690   200.00       49246   
15   2/20/2015   CANSEL SURVEY E PAY  1222.05       49246   
17   2/17/2015  UO414 TFR?FR xxx8690   250.00       49246   
19   2/11/2015  HI540 TFR?FR xxx8690   130.00       49246   
20   2/11/2015  HQ010 TFR?FR xxx8690   177.00       49246   
21   2/10/2015  WU455 TFR?FR xxx8690   200.00       49246   
22    2/6/2015  JJ500 TFR?FR xxx8690   301.00       49246   
23    2/6/2015   CANSEL SURVEY E PAY  1182.08       49246   
24    2/5/2015  IR453 TFR?FR xxx8690   168.56       49246   
26    2/2/2015  RQ574 TFR?FR xxx8690   500.00       49246   
27    2/2/2015  UT022 TFR?FR xxx8690   850.00       49246   
28  12/31/2014  HU521 TFR?FR xxx8690   950.17       49246   

                   Employer  
0   Cansel Survey Equipment  
2   Cansel Survey Equipment  
5   Cansel Survey Equipment  
9   Cansel Survey Equipment  
10  Cansel Survey Equipment  
11  Cansel Survey Equipment  
12  Cansel Survey Equipment  
13  Cansel Survey Equipment  
14  Cansel Survey Equipment  
15  Cansel Survey Equipment  
17  Cansel Survey Equipment  
19  Cansel Survey Equipment  
20  Cansel Survey Equipment  
21  Cansel Survey Equipment  
22  Cansel Survey Equipment  
23  Cansel Survey Equipment  
24  Cansel Survey Equipment  
26  Cansel Survey Equipment  
27  Cansel Survey Equipment  
28  Cansel Survey Equipment

I tried something like this, but it doesn't seem to work.:

df['Text_Search'] = df['Employer'].apply(lambda x: x.split(" "))
df['Match'] = np.where(df['Description'].str.contains("|".join(df['Text_Search'])), "Yes", "No")

My desired output would be as shown below:

          Date           Description   Amount  AutoNumber  \
0    3/17/2015  WW120 TFR?FR xxx8690   140.00       49246   
2    3/13/2015  JX154 TFR?FR xxx8690   150.00       49246   
5     3/6/2015   CANSEL SURVEY E PAY  1182.08       49246   
9     3/2/2015  UE200 TFR?FR xxx8690   180.00       49246   
10   2/27/2015  JH401 TFR?FR xxx8690   400.00       49246   
11   2/27/2015   CANSEL SURVEY E PAY   555.62       49246   
12   2/25/2015  HU204 TFR?FR xxx8690   200.00       49246   
13   2/23/2015  UQ263 TFR?FR xxx8690   102.00       49246   
14   2/23/2015  UT460 TFR?FR xxx8690   200.00       49246   
15   2/20/2015   CANSEL SURVEY E PAY  1222.05       49246   
17   2/17/2015  UO414 TFR?FR xxx8690   250.00       49246   
19   2/11/2015  HI540 TFR?FR xxx8690   130.00       49246   
20   2/11/2015  HQ010 TFR?FR xxx8690   177.00       49246   
21   2/10/2015  WU455 TFR?FR xxx8690   200.00       49246   
22    2/6/2015  JJ500 TFR?FR xxx8690   301.00       49246   
23    2/6/2015   CANSEL SURVEY E PAY  1182.08       49246   
24    2/5/2015  IR453 TFR?FR xxx8690   168.56       49246   
26    2/2/2015  RQ574 TFR?FR xxx8690   500.00       49246   
27    2/2/2015  UT022 TFR?FR xxx8690   850.00       49246   
28  12/31/2014  HU521 TFR?FR xxx8690   950.17       49246   
29  12/30/2014  WZ553 TFR?FR xxx8690   200.00       49246   
32  12/29/2014  JW173 TFR?FR xxx8690   300.00       49246   
33  12/24/2014   CANSEL SURVEY E PAY  1219.21       49246   
34  12/24/2014   CANSEL SURVEY E PAY   434.84       49246   
36  12/23/2014  WT002 TFR?FR xxx8690   160.00       49246   

                   Employer                  Text_Search Match  
0   Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
2   Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
5   Cansel Survey Equipment  [Cansel, Survey, Equipment]    Yes 
9   Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
10  Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
11  Cansel Survey Equipment  [Cansel, Survey, Equipment]    Yes  
12  Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
13  Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
14  Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
15  Cansel Survey Equipment  [Cansel, Survey, Equipment]    Yes  
17  Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
19  Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
20  Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
21  Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
22  Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
23  Cansel Survey Equipment  [Cansel, Survey, Equipment]    Yes  
24  Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
26  Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
27  Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
28  Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
29  Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
32  Cansel Survey Equipment  [Cansel, Survey, Equipment]    No  
33  Cansel Survey Equipment  [Cansel, Survey, Equipment]    Yes  
34  Cansel Survey Equipment  [Cansel, Survey, Equipment]    Yes  
36  Cansel Survey Equipment  [Cansel, Survey, Equipment]    No

672

asked Mar 08 '17 22:03

Riley Hun

1 Answers

Timing comparison for different solutions

Let's prepare a bit bigger DF - 2.000 rows:

In [3]: df = pd.concat([df] * 10**2, ignore_index=True)

In [4]: df.shape
Out[4]: (2000, 5)

Solution 1: df.apply(..., axis=1):

df["Text_Search"] = df.Employer.str.lower().str.split().map(set)

In [15]: %%timeit
    ...: df.Description.str.lower().str.split().map(set).to_frame('desc') \
    ...:               .apply(lambda r: (df["Text_Search"] & r.desc).any(),
    ...:                      axis=1)
    ...:
1 loop, best of 3: 5.06 s per loop

Solution 2: CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(min_df=1, lowercase=True)

In [8]: %%timeit
   ...: X = vect.fit_transform(df['Employer'])
   ...: cols_emp = vect.get_feature_names()
   ...: X = vect.fit_transform(df['Description'])
   ...: cols_desc = vect.get_feature_names()
   ...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
   ...: (X.toarray()[:, common_cols_idx] == 1).any(1)
   ...:
10 loops, best of 3: 88.2 ms per loop

Solution 3: df.apply(search_func, axis=1)

df["Text_Search"] = df["Employer"].str.lower().str.split()

In [12]: %%timeit
    ...: df.apply(search_func, axis=1)
    ...:
1 loop, best of 3: 362 ms per loop

NOTE: Solution 1 is too slow, so i will not "timeit" this solution for bigger DFs

Comparison of df.apply(search_func, axis=1) and CountVectorizer for 20.000 rows DF:

In [16]: df = pd.concat([df] * 10, ignore_index=True)

In [17]: df.shape
Out[17]: (20000, 6)

In [20]: %%timeit
    ...: df.apply(search_func, axis=1)
    ...:
1 loop, best of 3: 3.66 s per loop

In [21]: %%timeit
    ...: X = vect.fit_transform(df['Employer'])
    ...: cols_emp = vect.get_feature_names()
    ...: X = vect.fit_transform(df['Description'])
    ...: cols_desc = vect.get_feature_names()
    ...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
    ...: (X.toarray()[:, common_cols_idx] == 1).any(1)
    ...:
1 loop, best of 3: 825 ms per loop

Comparison of df.apply(search_func, axis=1) and CountVectorizer for 200.000 rows DF:

In [22]: df = pd.concat([df] * 10, ignore_index=True)

In [23]: df.shape
Out[23]: (200000, 6)

In [24]: %%timeit
    ...: df.apply(search_func, axis=1)
    ...:
1 loop, best of 3: 36.8 s per loop

In [25]: %%timeit
    ...: X = vect.fit_transform(df['Employer'])
    ...: cols_emp = vect.get_feature_names()
    ...: X = vect.fit_transform(df['Description'])
    ...: cols_desc = vect.get_feature_names()
    ...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
    ...: (X.toarray()[:, common_cols_idx] == 1).any(1)
    ...:
1 loop, best of 3: 8.28 s per loop

Conclusion: CountVectorized solution is apporx. 4.44 times faster compared to df.apply(search_func, axis=1)

154

answered Oct 21 '22 01:10

MaxU - stop WAR against UA

Related questions
                            
                                Python: How to use Value and Array in Multiprocessing pool
                            
                                await for any future asyncio
                            
                                How to force recompile of py source in site-packages?
                            
                                Accessing the choices passed to argument in argparser?
                            
                                add dimension to an xarray DataArray
                            
                                GitPython list all files affected by a certain commit
                            
                                Spider must return Request, BaseItem, dict or None, got 'set'
                            
                                UnicodeDecodeError with pandas.read_sql
                            
                                NLP - information extraction in Python (spaCy)
                            
                                Gaussian Fit on noisy and 'interesting' data set
                            
                                Running multiple services using dev_appserver.py on different ports
                            
                                Incremental model update with PyMC3
                            
                                Append empty rows to Dataframe in pandas
                            
                                Python ftplib.error_perm: 530 Login authentication failed
                            
                                Pandas, filter rows which column contain another column
                            
                                What is the difference between `sys.meta_path` and `sys.path_hooks` importer objects?
                            
                                Google Sheets API Python - Clear sheet
                            
                                rpy2 doesn't work - requires libiconv.so.2
                            
                                Return Pandas dataframe as JSONP response in Python Flask
                            
                                delimiter of tab '\t' of csv.writer in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python/Pandas: How to Match List of Strings with a DataFrame column

Tags:

python

string

regex

matching

pandas

Riley Hun

People also ask

1 Answers

Timing comparison for different solutions

MaxU - stop WAR against UA

Recent Activity

Donate For Us