Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are there any algorithms to categorize an array among certain patterns?

For a simple problem of array length 5 to start with ( in practice the array length might be 20.. )

I have got a predefined set of patterns, like AAAAB, AAABA, BAABC, BCAAA, .... Each pattern is of the same length of the input array. I would need a function that takes any integer array as input, and returns all the patterns it matches. (an array may match a few patterns) as fast as possible.

"A" means that in the pattern all numbers at the positions of A are equal. E.g. AAAAA simply means all numbers are equal, {1, 1, 1, 1, 1} matches AAAAA.

"B" means the number at the positions B are not equal to the number at the position of A. (i.e. a wildcard for a number which is not A)Numbers represented by B don't have to be equal. E.g. ABBAA means the 1st, 4th, 5th numbers are equal to, say x, and 2nd, 3rd are not equal to x. {2, 3, 4, 2, 2} matches ABBAA.

"C" means this position can be any number (i.e. a wildcard for a number). {1, 2, 3, 5, 1} matches ACBBA, {1, 1, 3, 5, 1} also matches ACBBA

I am looking for an efficient ( in terms of comparisons number) algorithm. It doesn't have to be optimal, but shouldn't be too bad from optimal. I feel it is sort-of like the decision tree...

A very straightforward but inefficient way is like the following:

  • Try to match each pattern against the input. say AABCA against {a, b, c, d, e}. It checks if (a=b=e && a!=c).

  • If the number of patterns is n, the length of the pattern/array is m, then the complexity is about O(n*m)

Update:

Please feel free to suggest better wordings for the question, as I don't know how to make the question simple to understand without confusions.

An ideal algorithm would need some kind of preparation, like to transform the set of patterns into a decision tree. So that the complexities after preprocessing can be achieved to something like O(log n * log m) for some special pattern sets.(just a guess)

Some figures that maybe helpful: the predefined pattern sets is roughly of the size of 30. The number of input arrays to match with is about 10 millions.

Say, if AAAAA and AAAAC are both in the pre defined pattern set. Then if AAAAA matches, AAAAC matches as well. I am looking for an algorithm which could recognize that.

Update 2

@Gareth Rees 's answer gives a O(n) solution, but under assumption that there are not many "C"s. (otherwise the storage is huge and many unnecessary comparisons)

I would also welcome any ideas on how to deal with situations where there are many "C"s, say, for input array of length 20, there are at least 10 "C"s for each predefined patterns.

like image 545
colinfang Avatar asked Dec 15 '12 18:12

colinfang


1 Answers

Here's an idea that trades O(2n) preparation and storage for O(n)-ish runtime. If your arrays are no longer than your machine's word size (you imply that 20 would be a typical size), or if there are not too many occurrences of C in the patterns, this idea might work for you. (If neither of these conditions is satisfied, avoid!)

  1. (Preparatory step, done once.) Create a dictionary d mapping numbers to sets of patterns. For each pattern p, and each subset S of the occurrences of C in that pattern, let n be the number that has a set bit corresponding to each A in the pattern, and for each occurrence of C in S. Add p to the set of patterns d[n].

  2. (Remaining steps are done each time a new array needs to be matched against the patterns.) Create a dictionary e mapping numbers to numbers.

  3. Let j run over the indexes of the array, and for each j:

    1. Let i be the j-th integer in the array.

    2. If i is not in the dictionary e, set e[i] = 0.

    3. Set e[i] = e[i] + 2ℓ − j − 1 where ℓ is the length of the array.

  4. Now the keys of e are the distinct numbers i in the array, and the value e[i] has a set bit corresponding to each occurrence of i in the array. For each value e[i] that is found in the dictionary d, all the patterns in the set d[e[i]] match the array.

(Note: in practice you'd build the bitsets the other way round, and use 2j at step 3.3 instead of 2ℓ − j − 1, but I've described the algorithm this way for clarity of exposition.)

Here's an example. Suppose we have the patterns AABBA and ACBBA. In the preprocessing step AABBA turns into the number 25 (11001 in binary), and ACBBA turns into the numbers 25 (11001 in binary) and 17 (10001 in binary), for the two possible subsets of the occurrences of C in the pattern. So the dictionary d looks like this:

  • 17 → {ACBBA}
  • 25 → {AABBA, ACBBA}

After processing the array {1, 2, 3, 5, 1} we have e = {1 → 17, 2 → 8, 3 → 4, 5 → 2}. The value e[1] = 17 is found in d, so this input matches the pattern ACBBA.

After processing the array {1, 1, 2, 3, 1} we have e = {1 → 25, 2 → 4, 3 → 2}. The value e[1] = 25 is found in d, so this input matches the patterns AABBA and ACBBA.

like image 72
Gareth Rees Avatar answered Oct 15 '22 14:10

Gareth Rees