Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pattern in continuous sequence data

Suppose I have a list of events. For example A, D, T, H, U, A, B, F, H, ....

What I need is to find frequent patterns that occur in the complete sequence. In this problem we cannot use traditional algorithms like a priori or fp growth because they require separate item sets. And, I cannot break this stream into smaller sets.

Any idea which algorithm would work for me?


EDIT

For example, for the sequence A, D, T, H, U, A, D, T, H, T, H, U, A, H, T, H and with min_support = 2.

The frequent patterns will be

Of length 1 --> [A, D, T, H, U]
Of length 2 --> [AD, DT, TH, HU, UA, HT]
Of length 3 --> [ADT, DTH, THU, HUA]
Of length 4 --> [ADTH, THUA]
No sequences of length 5 and further
like image 239
Haris Avatar asked Oct 18 '15 08:10

Haris


1 Answers

You can try aho-corasick algorithm with a wildcard and/or just with all substrings. Aho-corasick is basically a finite state machine it needs a dictionary but then it find multiple pattern in the search string very fast. You can build a finite state machine with a trie and a breadth-first search. Here is nice example with animation:http://blog.ivank.net/aho-corasick-algorithm-in-as3.html. So you need basically 2 steps: build the finite state machine and search the string.

like image 58
Micromega Avatar answered Sep 20 '22 15:09

Micromega