Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you tell if two wildcards overlap?

Given two strings with * wildcards, I would like to know if a string could be created that would match both.

For example, these two are a simple case of overlap:

  1. Hello*World
  2. Hel*

But so are all of these:

  1. *.csv
  2. reports*.csv
  3. reportsdump.csv

Is there an algorithm published for doing this? Or perhaps a utility function in Windows or a library I might be able to call or copy?

like image 336
Tom Ritter Avatar asked May 12 '10 18:05

Tom Ritter


3 Answers

You can solve this in time linear in the sum of the pattern lengths:

If both strings start or end with non-wildcards, check that they match until one pattern hits a wildcard (otherwise they don't match). This reduces the problem to the case where at least one pattern starts with a wildcard and at least one pattern ends with a wildcard. If both patterns have wildcards (somewhere), then they have to match:

  • if p1 starts with a wildcard and p2 ends with a wildcard, use the p1 wildcard to eat up all of p2 up to its last wildcard, then use the p2 wildcard to eat up all of p1
  • if p1 starts and ends with a wildcard, then use its starting wildcard to eat up p2 up to its first wildcard, then use the p2 wildcard to eat up p1 to its last wildcard, then use the last p1 wildcard to eat up the rest of p2

Otherwise, one string (p1) has no wildcards, and the other string (p2) has strings s1,s2,...punctuated with wildcards. So just search for the first occurrence of s1 in p1, then for the first subsequent occurrence of s2 (starting from the end of the match in p1), etc. If you find all of the strings, then the patterns match, otherwise they don't.

like image 61
user2949652 Avatar answered Oct 17 '22 23:10

user2949652


For what it's worth, here's one implementation of the algorithm from sepp2k's answer in C# (I used explicit return true; and return false; calls, along with comments, for algorithm readability):

public static bool WildcardIntersect(string w1, string w2)
{
    // if both are empty or contain wildcards
    if ((string.IsNullOrEmpty(w1) || w1 == "*")
        && (string.IsNullOrEmpty(w2) || w2 == "*"))
        return true;

    // if either string is empty, return false
    // we can do this because we know the other string MUST be non-empty and non-wildcard
    if (string.IsNullOrEmpty(w1) || string.IsNullOrEmpty(w2))
        return false;

    char c1 = w1[0], // first character of wildcard string 1
         c2 = w2[0]; // first character of wildcard string 2
    string remain1 = w1.Substring(1), // remaining of wildcard string 1
           remain2 = w2.Substring(1); // remaining of wildcard string 2

    // if first letters match and remaining intersect
    if ((c1 == c2 && WildcardIntersect(remain1, remain2))
        // if either is a wildcard and either remaining intersects with the other whole
        || ((c1 == '*' || c2 == '*') && (WildcardIntersect(w1, remain2) || WildcardIntersect(remain1, w2))))
        return true;

    // else, no match, return false
    return false;
}
like image 39
kdawg Avatar answered Oct 17 '22 21:10

kdawg


Since every glob can be written as a regular expression and the intersection of two regular expressions can be found (unless they aren't really regular, but they would be in this case), you can find the intersection of two globs by transforming them into regular expressions and then finding the intersection of those. So you can find out whether two globs intersect by finding the intersection of the regular expressions and checking whether it's empty.

However since globs are more limited than regular expression, there is a much easier way:

Let's call the two globs g1 and g2. They intersect iff

  1. Both g1 and g2 are empty or only contain wildcards.
  2. Neither g1 nor g2 are empty and one of the following conditions is true (let c1 be the first character of g1 and t1 the string containing the remaining characters - same for g2 with c2 and t2):
    1. c1 and c2 are equal and t1 and t2 intersect
    2. c1 and/or c2 is a wildcard and t1 intersects with g2
    3. c1 and/or c2 is a wildcard and g1 intersects with t2

An example implementation in haskell:

intersect g1          []          = all (== '*') g1
intersect []          g2          = all (== '*') g2
intersect g1@('*':t1) g2@(c2:t2)  = intersect g1 t2 || intersect t1 g2
intersect g1@(c1:t1)  g2@('*':t2) = intersect t1 g2 || intersect g1 t2
intersect    (c1:t1)     (c2:t2)  = c1 == c2        && intersect t1 t2

This algorithm isn't particular efficient if the globs contain a lot of wildcards, but it's very easy to implement and since you're likely planning to use this with filenames, I doubt you'll have globs longer than 1000 chars.

like image 8
sepp2k Avatar answered Oct 17 '22 22:10

sepp2k