Pattern in continuous sequence data

Question

Suppose I have a list of events. For example A, D, T, H, U, A, B, F, H, ....

What I need is to find frequent patterns that occur in the complete sequence. In this problem we cannot use traditional algorithms like a priori or fp growth because they require separate item sets. And, I cannot break this stream into smaller sets.

Any idea which algorithm would work for me?

EDIT

For example, for the sequence A, D, T, H, U, A, D, T, H, T, H, U, A, H, T, H and with min_support = 2.

The frequent patterns will be

Of length 1 --> [A, D, T, H, U]
Of length 2 --> [AD, DT, TH, HU, UA, HT]
Of length 3 --> [ADT, DTH, THU, HUA]
Of length 4 --> [ADTH, THUA]
No sequences of length 5 and further

Micromega · Accepted Answer

You can try aho-corasick algorithm with a wildcard and/or just with all substrings. Aho-corasick is basically a finite state machine it needs a dictionary but then it find multiple pattern in the search string very fast. You can build a finite state machine with a trie and a breadth-first search. Here is nice example with animation:http://blog.ivank.net/aho-corasick-algorithm-in-as3.html. So you need basically 2 steps: build the finite state machine and search the string.

Pattern in continuous sequence data

Tags:

algorithm

frequency

data-mining

sequence-analysis

Haris

1 Answers

Micromega

Recent Activity

Donate For Us

Pattern in continuous sequence data

Tags:

algorithm

frequency

data-mining

sequence-analysis

Haris

1 Answers

Micromega

Related questions

Recent Activity

Donate For Us