Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Symbolic representation of patterns in strings, and finding "similar" sub-patterns

A string "abab" could be thought of as a pattern of indexed symbols "0101". And a string "bcbc" would also be represented by "0101". That's pretty nifty and makes for powerful comparisons, but it quickly falls apart out of perfect cases.

"babcbc" would be "010202". If I wanted to note that it contains a pattern equal to "0101" (the bcbc part), I can only think of doing some sort of normalization process at each index to "re-represent" the substring from n to length symbolically for comparison. And that gets complicated if I'm trying to see if "babcbc" and "dababd" (010202 vs 012120) have anything in common. So inefficient!

How could this be done efficiently, taking care of all possible nested cases? Note that I'm looking for similar patterns, not similar sub-strings in the actual text.

like image 308
user173342 Avatar asked Nov 12 '22 22:11

user173342


1 Answers

Try replacing each character with min(K, distance back to previous occurrence of that character), where K is a tunable constant so babcbc and dababd become something like KK2K22 and KKK225. You could use a suffix tree or suffix array to find repeats in the transformed text.

like image 167
mcdowella Avatar answered Dec 09 '22 11:12

mcdowella