I have 2 dataframes: one (A
) with some whitelist hostnames in regex form (ie (.*)microsoft.com
, (*.)go.microsoft.com
...) and another (B
) with actual full hostnames of sites. I want to add a new column to this 2nd dataframe with the regex text of the Whitelist (1st) dataframe. However, it appears that Pandas' .replace()
method doesn't care about what order items are in for its to_replace
and value
args.
My data looks like this:
In [1] A
Out[1]:
wildcards \
42 (.*)activation.playready.microsoft.com
35 (.*)v10.vortex-win.data.microsoft.com
40 (.*)settings-win.data.microsoft.com
43 (.*)smartscreen.microsoft.com
39 (.*).playready.microsoft.com
38 (.*)go.microsoft.com
240 (.*)i.microsoft.com
238 (.*)microsoft.com
regex
42 re.compile('^(.*)activation.playready.microsof...
35 re.compile('^(.*)v10.vortex-win.data.microsoft...
40 re.compile('^(.*)settings-win.data.microsoft.c...
43 re.compile('^(.*)smartscreen.microsoft.com$')
39 re.compile('^(.*).playready.microsoft.com$')
38 re.compile('^(.*)go.microsoft.com$')
240 re.compile('^(.*)i.microsoft.com$')
238 re.compile('^(.*)microsoft.com$')
In [2] B.head()
Out[2]:
server_hostname
146 mobile.pipe.aria.microsoft.com
205 settings-win.data.microsoft.com
341 nav.smartscreen.microsoft.com
406 v10.vortex-win.data.microsoft.com
667 www.microsoft.com
Notice that A
has a column of compiled regexes in similar form to the wildcards
column. I want to add a wildcard
column to B
like this:
B.loc[:,'wildcards'] = B['server_hostname'].replace(A['regex'].tolist(), A['wildcards'].tolist())
But the problem is, all of B
's wildcard values become (.*)microsoft.com
. This happens no matter the order of A
's wildcard values. It appears .replace()
aims to use the to_replace
regex's by shortest value first rather than the order provided.
How can I provide a list of to_replace
values so that I ultimately get the most details hostname wildcards
value associated with B
's server_hostname
values?
Most answers use apply()
which is known to be slower than built-in vector function solutions. My hope in using .replace()
was that it would be fast since it is such a built in vector function. @vlemaistre's answer was the only one to not use .apply()
as is my solution here, which instead of compiling each wildcard into a regex, it treats it as a right-hand substring to use logic: "If server_hostname
ends with wildcard
, then it's a match". So long as I sort my wildcards by length, then it works just fine.
My function which does this is:
def match_to_whitelist(accepts_df, whitelist_df):
""" Adds `whitelists` column to accepts_df showing which (if any) whitelist entry it matches with """
accepts_df.loc[:, 'wildcards'] = None
for wildcard in whitelist_df['wildcards']:
accepts_df.loc[(accepts_df['wildcards'].isnull()) & (
accepts_df['server_hostname'].str.endswith(wildcard)), 'wildcards'] = wildcard
rows_matched = len(accepts_df['wildcards'].notnull())
matched {rows_matched}")
return accepts_df
Here, accepts_df
is like B
from before, and whitelist_df
is like A
before, but with 2 differences:
regex
columnwildcards
values are no longer in glob/regex format (ie "(.*)microsoft.com" becomes "microsoft.com"To benchmark answers on my machine, I'll use mine as a baseline, taking 27secs to process 100k accepts_df
rows with 400 whitelist_df
rows. Using the same dataset, here are the times for other solutions (i was lazy: if they didn't run out the gate, I didn't debug much to find out):
A
is sorted): over 6mins, so quit...lastgroup
: didn't test because I can't (quickly) build a long RE programatically. Series.str.match()
: Error: "pandas.core.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects"Ultimately, none of our answers say how to use .replace()
as desired, so for the time being, I'll leave this unanswered for a few weeks in case someone can provide an answer to better use .replace()
, or at least some other fast vector-based solution. Until then, I'll keep with what I have, or maybe use aws_apprentice's after I verify results.
EDIT
I improved my matcher by adding a "domain" column to both DFs, which consists of the last 2 parts of each wildcard/server_hostname (ie www.microsoft.com becomes "microsoft.com"). I then used groupby('domain')
on both DFs, iterated through the whitelist groups of domains, fetched the same domain-group from the server_hostname DF (B) and did my matching just using the subset of wildcards/server_hostnames from each group. This cut my processing time to match in half.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With