Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Dataframe check if a value exists using regex

I have a big dataframe and I want to check if any cell contains admin string.

   col1                   col2 ... coln
0   323           roster_admin ... rota_user
1   542  assignment_rule_admin ... application_admin
2   123           contact_user ... configuration_manager
3   235         admin_incident ... incident_user
... ...  ...                   ... ...

I tried to use df.isin(['*admin*']).any() but it seems like isin doesn't support regex. How can I search though all columns using regex?

I have avoided using loops because the dataframe contains over 10 million rows and many columns and the efficiency is important for me.

like image 460
Michael Dz Avatar asked Jul 04 '18 09:07

Michael Dz


2 Answers

There are two solutions:

  1. df.col.apply method is more straightforward but also a little bit slower:

    In [1]: import pandas as pd
    
    In [2]: import re
    
    In [3]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':['admin', 'aa', 'bb', 'c_admin_d', 'ee_admin']})
    
    In [4]: df
    Out[4]: 
       col1       col2
    0     1      admin
    1     2         aa
    2     3         bb
    3     4  c_admin_d
    4     5   ee_admin
    
    In [5]: r = re.compile(r'.*(admin).*')
    
    In [6]: df.col2.apply(lambda x: bool(r.match(x)))
    Out[6]: 
    0     True
    1    False
    2    False
    3     True
    4     True
    Name: col2, dtype: bool
    
    In [7]: %timeit -n 100000 df.col2.apply(lambda x: bool(r.match(x)))
    167 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

  1. np.vectorize method require import numpy, but it's more efficient (about 4 times faster in my timeit test).

    In [1]: import numpy as np
    
    In [2]: import pandas as pd
    
    In [3]: import re
    
    In [4]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':['admin', 'aa', 'bb', 'c_admin_d', 'ee_admin']})
    
    In [5]: df
    Out[5]: 
       col1       col2
    0     1      admin
    1     2         aa
    2     3         bb
    3     4  c_admin_d
    4     5   ee_admin
    
    In [6]: r = re.compile(r'.*(admin).*')
    
    In [7]: regmatch = np.vectorize(lambda x: bool(r.match(x)))
    
    In [8]: regmatch(df.col2.values)
    Out[8]: array([ True, False, False,  True,  True])
    
    In [9]: %timeit -n 100000 regmatch(df.col2.values)
    43.4 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

Since you have changed your question to check any cell, and also concern about time efficiency:

# if you want to check all columns no mater what `dtypes` they are
dfs = df.astype(str, copy=True, errors='raise')
regmatch(dfs.values) # This will return a 2-d array of booleans
regmatch(dfs.values).any() # For existence.

You can still use df.applymap method, but again, it will be slower.

dfs = df.astype(str, copy=True, errors='raise')
r = re.compile(r'.*(admin).*')
dfs.applymap(lambda x: bool(r.match(x))) # This will return a dataframe of booleans.
dfs.applymap(lambda x: bool(r.match(x))).any().any() # For existence.
like image 154
YaOzI Avatar answered Nov 17 '22 20:11

YaOzI


Try this:

import pandas as pd

df=pd.DataFrame(
    {'col1': [323,542,123,235],
     'col2': ['roster_admin','assignment_rule_admin','contact_user','admin_incident'] ,
    })

df.apply(lambda row: row.astype(str).str.contains('admin').any(), axis=1)

Output:

0     True
1     True
2    False
3     True
dtype: bool
like image 20
min2bro Avatar answered Nov 17 '22 20:11

min2bro