Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Checking collision in filename search patterns with wildcards

I need to compare file system wildcard expressions to see whether their results would overlap, by only examining/comparing the expressions.

For sake of example, we are building a utility that would sort files from one (or more locations) into a separate folders based on file system wildcard expressions. For example: *.txt goes into folder a, *.doc goes into folder b, and so on. The wildcard characters we would support would be * and ?

I want to be able to determine from just analyzing the wildcard expressions, whether they would conflict/overlap.

For example, if I have the following expressions:

*.x.y
*.y

They would conflict (overlap) because the second expression *.y would include *.x.y results. (ex. A.x.y would match both expressions)

I am approaching this by building a tree structure using the using all of the expressions, figuring that the very act of building a tree will fail if the expressions conflict.

For example:
*.x
a.b
a.c
b.d

might create a tree like 

         +-*-.-x
         |
start +--+
         |     +-b
         |     |
         +-a-.-+-c
         |
         |
         +-b-.-d

If I try to add the pattern b.x, the tree would be successful following the *.x path, and thereby say that the pattern already exists.

Am I heading in the correct direction? Or is there an known algorithm for attacking this?

like image 996
Rick Hodder Avatar asked Nov 30 '15 23:11

Rick Hodder


People also ask

How do you use a wildcard in a filename?

When you have a number of files named in series (for example, chap1 to chap12) or filenames with common characters (like aegis, aeon, and aerie), you can use wildcards (also called metacharacters) to specify many files at once. These special characters are * (asterisk), ? (question mark), and [ ] (square brackets).

Is a wildcard character used to search for a file?

Wildcards are used in search terms to represent one or more other characters. The two most commonly used wildcards in our library databases are: An asterisk (*) may be used to specify any number of characters.

How do you use wildcards to search files and folders?

You can use your own wildcards to limit search results. You can use a question mark (?) as a wildcard for a single character and an asterisk (*) as a wildcard for any number of characters. For example, *. pdf would return only files with the PDF extension.

What is filename pattern?

You can specify a file name pattern, using wildcard characters, to identify a file to be read by the FileInput, CDInput, and FTEInput nodes. You can also specify a file name pattern, using a single wildcard character, to name the file to be created by the FileOutput and FTEOutput nodes.


1 Answers

To check whether two wildcard patterns could match the same filename, you can look at this problem as creating a grid of comparisons between pairs of characters, and then checking whether there exists a diagonal path. The illustration below shows how wildcard patterns ab?.c?? and a*bc.* can be checked for possible conflict:

wildcard conflict animation

When a match between two identical literal characters is found, you move diagonally to the next characters to check. (indicated with green arrow)

When a literal character and a single-character wild card ? are encountered, there are two possibilities: either the wild card matches the character (move diagonally), or the wildcard matches empty space, and you skip over it. (indicated with purple arrows)

When a multi-character wild card * is encountered, three possibilities need to be taken into account: the wild card matches an empty space before the other character, the wild card matches the other character, or the wild card matches multiple characters. (indicated with blue arrows)

wildcard conflict comparisons

code example 1 (iterative)

Below is a simple javascript implementation which iterates over the grid diagonally, marks cells which can be reached from the current cell, and then checks whether the bottom right cell is reachable. Run the code snippet to see a few examples. (update: top-to-bottom left-to-right will do fine instead of diagonally)

function wildConflict(wild1, wild2) {
    var grid = [[true]], width = wild1.length, height = wild2.length;
    for (var x = 1; x <= width; x++) grid[x] = [];
    for (var y = 0; y < height; y++) {
        for (var x = 0; x < width; x++) {
            if (grid[x][y]) {
                var a = wild1.charAt(x), b = wild2.charAt(y);
                if (a == '*' || b == '*' || a == '?' || b == '?' || a == b) grid[x + 1][y + 1] = true;
                if (a == '*' || b == '*' || a == '?') grid[x + 1][y] = true;
                if (a == '*' || b == '*' || b == '?') grid[x][y + 1] = true;
            }
        }
    }
    return grid[width][height] == true;
}

var a = ["a", "a", "a*", "abc", "a*", "*.x.y", "*.x.y", "a*b*", "a*bc.*", "a?c.de"];
var b = ["a", "b", "abc", "a?", "*b", "*.y", "*.x", "a*c*", "ab?.c??", "ac.d??"];
for (var i in a) document.write("&quot;" + a[i] + "&quot; &harr; &quot;" + b[i] + "&quot; &rarr; " + wildConflict(a[i], b[i]) + "<BR>");

code example 2 (recursive)

A simple recursive implementation has the drawback of potentially checking some character pairs more than once. It doesn't need the 2D-array, but the recursions obviously use memory too.

Note that when a multi-character wild card *is encountered, the algorithm recurses with only two possibilities: jump over the one character, or jump over the other character; jumping over both characters (i.e. the wild card matches exactly one character) is taken care of in the next step, when the wild card is compared to the next character.

function wildConflict(wild1, wild2) {
    var w1 = wild1.split(''), w2 = wild2.split('');
    return conflict(0, 0);

    function conflict(p1, p2) {
        if (p1 == w1.length || p2 == w2.length) {
            if ((p1 == w1.length && p2 == w2.length)
            || (p1 == w1.length - 1 && (w1[p1] == '*' || w1[p1] == '?'))
            || (p2 == w2.length - 1 && (w2[p2] == '*' || w2[p2] == '?'))) {
                return true;
            }
            else return false;                            // premature end
        }
        else if (w1[p1] == '*' || w2[p2] == '*' || (w1[p1] == '?' && w2[p2] == '?')) {
            return conflict(p1 + 1, p2) || conflict(p1, p2 + 1);
        }
        else if (w1[p1] == '?') {
            return conflict(p1 + 1, p2) || conflict(p1 + 1, p2 + 1);
        }
        else if (w2[p2] == '?') {
            return conflict(p1, p2 + 1) || conflict(p1 + 1, p2 + 1);
        }
        else if (w1[p1] == w2[p2]) {
            return conflict(p1 + 1, p2 + 1);
        }
        else return false;                               // unequal literals
    }
}

var x = ["a", "a", "a*", "abc", "a*", "*.x.y", "*.x.y", "a*b*", "a*bc.*", "a?c.de"];
var y = ["a", "b", "abc", "a?", "*b", "*.y", "*.x", "a*c*", "ab?.c??", "ac.d??"];
for (var i in x) document.write("&quot;" + x[i] + "&quot; &harr; &quot;" + y[i] + "&quot; &rarr; " + wildConflict(x[i], y[i]) + "<BR>");
like image 53