Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use pandas .replace() with list of regexs while honoring list order?

I have 2 dataframes: one (A) with some whitelist hostnames in regex form (ie (.*)microsoft.com, (*.)go.microsoft.com...) and another (B) with actual full hostnames of sites. I want to add a new column to this 2nd dataframe with the regex text of the Whitelist (1st) dataframe. However, it appears that Pandas' .replace() method doesn't care about what order items are in for its to_replace and value args.

My data looks like this:

In [1] A
Out[1]: 
                                  wildcards  \
42   (.*)activation.playready.microsoft.com   
35    (.*)v10.vortex-win.data.microsoft.com   
40      (.*)settings-win.data.microsoft.com   
43            (.*)smartscreen.microsoft.com   
39             (.*).playready.microsoft.com   
38                     (.*)go.microsoft.com   
240                     (.*)i.microsoft.com   
238                       (.*)microsoft.com   
                                                 regex  
42   re.compile('^(.*)activation.playready.microsof...  
35   re.compile('^(.*)v10.vortex-win.data.microsoft...  
40   re.compile('^(.*)settings-win.data.microsoft.c...  
43       re.compile('^(.*)smartscreen.microsoft.com$')  
39        re.compile('^(.*).playready.microsoft.com$')  
38                re.compile('^(.*)go.microsoft.com$')  
240                re.compile('^(.*)i.microsoft.com$')  
238                  re.compile('^(.*)microsoft.com$')  


In [2] B.head()
Out[2]: 
                       server_hostname
146     mobile.pipe.aria.microsoft.com
205    settings-win.data.microsoft.com
341      nav.smartscreen.microsoft.com
406  v10.vortex-win.data.microsoft.com
667                  www.microsoft.com

Notice that A has a column of compiled regexes in similar form to the wildcards column. I want to add a wildcard column to B like this:

B.loc[:,'wildcards'] = B['server_hostname'].replace(A['regex'].tolist(), A['wildcards'].tolist())

But the problem is, all of B's wildcard values become (.*)microsoft.com. This happens no matter the order of A's wildcard values. It appears .replace() aims to use the to_replace regex's by shortest value first rather than the order provided.

How can I provide a list of to_replace values so that I ultimately get the most details hostname wildcards value associated with B's server_hostname values?

like image 373
hamx0r Avatar asked May 03 '19 00:05

hamx0r


1 Answers

Most answers use apply() which is known to be slower than built-in vector function solutions. My hope in using .replace() was that it would be fast since it is such a built in vector function. @vlemaistre's answer was the only one to not use .apply() as is my solution here, which instead of compiling each wildcard into a regex, it treats it as a right-hand substring to use logic: "If server_hostname ends with wildcard, then it's a match". So long as I sort my wildcards by length, then it works just fine.

My function which does this is:

def match_to_whitelist(accepts_df, whitelist_df):
    """ Adds `whitelists` column to accepts_df showing which (if any) whitelist entry it matches with """
    accepts_df.loc[:, 'wildcards'] = None
    for wildcard in whitelist_df['wildcards']:
        accepts_df.loc[(accepts_df['wildcards'].isnull()) & (
            accepts_df['server_hostname'].str.endswith(wildcard)), 'wildcards'] = wildcard
    rows_matched = len(accepts_df['wildcards'].notnull())
matched {rows_matched}")
    return accepts_df

Here, accepts_df is like B from before, and whitelist_df is like A before, but with 2 differences:

  1. no regex column
  2. the wildcards values are no longer in glob/regex format (ie "(.*)microsoft.com" becomes "microsoft.com"

To benchmark answers on my machine, I'll use mine as a baseline, taking 27secs to process 100k accepts_df rows with 400 whitelist_df rows. Using the same dataset, here are the times for other solutions (i was lazy: if they didn't run out the gate, I didn't debug much to find out):

  • @vlemaistre - List Comprehension with vector functions: 193sec
  • @user214 - SequenceMatcher: 234sec
  • @aws_apprentice - Compare RE search result lengths: 24sec
  • @fpersyn - First match (will be best match if A is sorted): over 6mins, so quit...
  • @Andy Hayden - lastgroup: didn't test because I can't (quickly) build a long RE programatically.
  • @capelastegui - Series.str.match(): Error: "pandas.core.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects"

Ultimately, none of our answers say how to use .replace() as desired, so for the time being, I'll leave this unanswered for a few weeks in case someone can provide an answer to better use .replace(), or at least some other fast vector-based solution. Until then, I'll keep with what I have, or maybe use aws_apprentice's after I verify results.

EDIT I improved my matcher by adding a "domain" column to both DFs, which consists of the last 2 parts of each wildcard/server_hostname (ie www.microsoft.com becomes "microsoft.com"). I then used groupby('domain') on both DFs, iterated through the whitelist groups of domains, fetched the same domain-group from the server_hostname DF (B) and did my matching just using the subset of wildcards/server_hostnames from each group. This cut my processing time to match in half.

like image 174
hamx0r Avatar answered Nov 15 '22 00:11

hamx0r