Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search for cyclic strings

I am looking for the most efficient way to store binary strings in a data structure (insert function) and then when getting a string I want to check if some cyclic string of the given string is in my structure.

I thought about storing the input strings in a Trie but then when trying to determine whether some cyclic string of the string I got now was inserted to the Trie means to do |s| searches in the Trie for all the possible cyclic strings.

Is there any way to do that more efficiently while the place complexity will be like in a Trie?

Note: When I say cyclic strings of a string I mean that for example all the cyclic strings of 1011 are: 0111, 1110, 1101, 1011

like image 324
user550413 Avatar asked Jan 20 '12 14:01

user550413


1 Answers

Can you come up with a canonicalizing function for cyclic strings based on the following:

  1. Find the largest run of zeroes.
  2. Rotate the string so that that run of zeroes is at the front.
  3. For each run of zeroes of equal size, see if rotating that to the front produces a lexicographically lesser string and if so use that.

This would canonicalize everything in the equivalence class (1011, 1101, 1110, 0111) to the lexicographically least value: 0111.

0101010101 is a thorny instance for which this algo will not perform well, but if your bits are roughly randomly distributed, it should work well in practice for long strings.

You can then hash based on the canonical form or use a trie that will include only the empty string and strings that start with 0 and a single trie run will answer your question.

EDIT:

if I have a string of a length |s| it can take a lot of time to find the least lexicographically value..how much time will it actually take?

That's why I said 010101.... is a value for which it performs badly. Let's say the string is of length n and the longest run of 1's is of length r. If the bits are randomly distributed, the length of the longest run is O(log n) according to "Distribution of longest run".

The time to find the longest run is O(n). You can implement shifting using an offset instead of a buffer copy, which should be O(1). The number of runs is worst case O(n / m).

Then, the time to do step 3 should be

  1. Find other long runs: one O(n) pass with O(log n) storage average case, O(n) worst case
  2. For each run: O(log n) average case, O(n) worst case
  3.   Shift and compare lexicographically: O(log n) average case since most comparisons of randomly chosen strings fail early, O(n) worst case.

This leads to a worst case of O(n²) but an average case of O(n + log² n) ≅ O(n).

like image 95
Mike Samuel Avatar answered Oct 05 '22 22:10

Mike Samuel