Symbolic representation of patterns in strings, and finding "similar" sub-patterns

Question

A string "abab" could be thought of as a pattern of indexed symbols "0101". And a string "bcbc" would also be represented by "0101". That's pretty nifty and makes for powerful comparisons, but it quickly falls apart out of perfect cases.

"babcbc" would be "010202". If I wanted to note that it contains a pattern equal to "0101" (the bcbc part), I can only think of doing some sort of normalization process at each index to "re-represent" the substring from n to length symbolically for comparison. And that gets complicated if I'm trying to see if "babcbc" and "dababd" (010202 vs 012120) have anything in common. So inefficient!

How could this be done efficiently, taking care of all possible nested cases? Note that I'm looking for similar patterns, not similar sub-strings in the actual text.

mcdowella · Accepted Answer

Try replacing each character with min(K, distance back to previous occurrence of that character), where K is a tunable constant so babcbc and dababd become something like KK2K22 and KKK225. You could use a suffix tree or suffix array to find repeats in the transformed text.

Symbolic representation of patterns in strings, and finding "similar" sub-patterns

Tags:

string

algorithm

pattern-matching

data-structures

user173342

1 Answers

mcdowella

Recent Activity

Donate For Us

Symbolic representation of patterns in strings, and finding "similar" sub-patterns

Tags:

string

algorithm

pattern-matching

data-structures

user173342

1 Answers

mcdowella

Related questions

Recent Activity

Donate For Us