I have a string s and I want to search for the substring of length X that occurs most often in s. Overlapping substrings are allowed. For example, if s="aoaoa" and X=3, the algorithm should find "aoa" (which appears 2 times in s). Does an algorithm exist that does this in O(n) time?

It should be O(n*m) where m is the average length of a string in the list. For very small values of m then the algorithm will approach O(n) <ul> <li>Build a hashtable of counts for each string length</li> <li>Iterate over your collection of strings, updating the hashtable accordingly, storing the current most prevelant number as an integer variable separate from the hashtable</li> <li>done.</li> </ul>

Most common substring of length X

3 Answers

You can do this using a rolling hash in O(n) time (assuming good hash distribution). A simple rolling hash would be the xor of the characters in the string, you can compute it incrementally from the previous substring hash using just 2 xors. (See the Wikipedia entry for better rolling hashes than xor.) Compute the hash of your n-x+1 substrings using the rolling hash in O(n) time. If there were no collisions, the answer is clear - if collisions happen, you'll need to do more work. My brain hurts trying to figure out if that can all be resolved in O(n) time.

Update:

Here's a randomized O(n) algorithm. You can find the top hash in O(n) time by scanning the hashtable (keeping it simple, assume no ties). Find one X-length string with that hash (keep a record in the hashtable, or just redo the rolling hash). Then use an O(n) string searching algorithm to find all occurrences of that string in s. If you find the same number of occurrences as you recorded in the hashtable, you're done.

If not, that means you have a hash collision. Pick a new random hash function and try again. If your hash function has log(n)+1 bits and is pairwise independent [Prob(h(s) == h(t)) < 1/2^{n+1} if s != t], then the probability that the most frequent x-length substring in s hash a collision with the <=n other length x substrings of s is at most 1/2. So if there is a collision, pick a new random hash function and retry, you will need only a constant number of tries before you succeed.

Now we only need a randomized pairwise independent rolling hash algorithm.

Update2:

Actually, you need 2log(n) bits of hash to avoid all (n choose 2) collisions because any collision may hide the right answer. Still doable, and it looks like hashing by general polynomial division should do the trick.

answered Sep 22 '22 13:09

Keith Randall

I don't see an easy way to do this in strictly O(n) time, unless X is fixed and can be considered a constant. If X is a parameter to the algorithm, then most simple ways of doing this will actually be O(n*X), as you will need to do comparison operations, string copies, hashes, etc., on a substring of length X at every iteration.

(I'm imagining, for a minute, that s is a multi-gigabyte string, and that X is some number over a million, and not seeing any simple ways of doing string comparison, or hashing substrings of length X, that are O(1), and not dependent on the size of X)

It might be possible to avoid string copies during scanning, by leaving everything in place, and to avoid re-hashing the entire substring -- perhaps by using an incremental hash algorithm where you can add a byte at a time, and remove the oldest byte -- but I don't know of any such algorithms that wouldn't result in huge numbers of collisions that would need to be filtered out with an expensive post-processing step.

Update

Keith Randall points out that this kind of hash is known as a rolling hash. It still remains, though, that you would have to store the starting string position for each match in your hash table, and then verify after scanning the string that all of your matches were true. You would need to sort the hashtable, which could contain n-X entries, based on the number of matches found for each hash key, and verify each result -- probably not doable in O(n).

answered Sep 19 '22 13:09

Ian Clelland

It should be O(n*m) where m is the average length of a string in the list. For very small values of m then the algorithm will approach O(n)

Build a hashtable of counts for each string length
Iterate over your collection of strings, updating the hashtable accordingly, storing the current most prevelant number as an integer variable separate from the hashtable
done.

answered Sep 21 '22 13:09

Chris Ballance

Related questions
                            
                                What string distance algorithm is best for measuring typing accuracy?
                            
                                What is the best algorithm to solve this puzzle?
                            
                                Lightweight (de)compression algorithm for embedded use
                            
                                How to handle Big O when one variable is known to be smaller than another one?
                            
                                How to build N bits variables in C++?
                            
                                Haskell groupBy depending on accumulator value
                            
                                Particle Dynamics
                            
                                Can Dijkstra's Algorithm work on a graph with weights of 0?
                            
                                Why can’t you use Hash Tables/Dictionaries in Counting Sort algorithm?
                            
                                Get the longest route traveled in a graph
                            
                                Is it possible to rearrange an array with constant memory overhead?
                            
                                QuickSelect with Hoare partition scheme
                            
                                Working with equal occurrences of characters in a string of characters
                            
                                Given a Python list of lists, find all possible flat lists that keeps the order of each sublist?
                            
                                How do you implement Related tags functionality as used in Stackoverflow.com?
                            
                                Easy way to find Subtree in a Tree
                            
                                How to traverse a binary tree in O(n) time without extra memory
                            
                                Combining semacodes and steganography?
                            
                                Longest Non-Overlapping Substring
                            
                                How to optimize the layout of rectangles

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Most common substring of length X

Tags:

substring

algorithm

qjkx

People also ask

3 Answers

Keith Randall

Ian Clelland

Chris Ballance

Recent Activity

Donate For Us