Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient string matching algorithm

Tags:

c#

algorithm

I'm trying to build an efficient string matching algorithm. This will execute in a high-volume environment, so performance is critical.

Here are my requirements:

  • Given a domain name, i.e. www.example.com, determine if it "matches" one in a list of entries.
  • Entries may be absolute matches, i.e. www.example.com.
  • Entries may include wildcards, i.e. *.example.com.
  • Wildcard entries match from the most-defined level and up. For example, *.example.com would match www.example.com, example.com, and sub.www.example.com.
  • Wildcard entries are not embedded, i.e. sub.*.example.com will not be an entry.

Language/environment: C# (.Net Framework 3.5)

I've considered splitting the entries (and domain lookup) into arrays, reversing the order, then iterating through the arrays. While accurate, it feels slow.

I've considered Regex, but am concerned about accurately representing the list of entries as regular expressions.

My question: what's an efficient way of finding if a string, in the form of a domain name, matches any one in a list of strings, given the description listed above?

like image 956
jro Avatar asked Nov 28 '22 10:11

jro


2 Answers

If you're looking to roll your own, I would store the entries in a tree structure. See my answer to another SO question about spell checkers to see what I mean.

Rather than tokenize the structure by "." characters, I would just treat each entry as a full string. Any tokenized implementation would still have to do string matching on the full set of characters anyway, so you may as well do it all in one shot.

The only differences between this and a regular spell-checking tree are:

  1. The matching needs to be done in reverse
  2. You have to take into account the wildcards

To address point #2, you would simply check for the "*" character at the end of a test.

A quick example:

Entries:

*.fark.com
www.cnn.com

Tree:

m -> o -> c -> . -> k -> r -> a -> f -> . -> *
                \
                 -> n -> n -> c -> . -> w -> w -> w

Checking www.blog.fark.com would involve tracing through the tree up to the first "*". Because the traversal ended on a "*", there is a match.

Checking www.cern.com would fail on the second "n" of n,n,c,...

Checking dev.www.cnn.com would also fail, since the traversal ends on a character other than "*".

like image 74
e.James Avatar answered Dec 21 '22 06:12

e.James


I would use Regex, just make sure to have it the expression compiled once (instead of it being calculated again and again).

like image 22
eglasius Avatar answered Dec 21 '22 06:12

eglasius