Is there any efficient way to find the duplicate substring? Here, duplicate means that two same substring close to each other have the same value without overlap. For example, the source string is: <pre class="prettyprint"><code>ABCDDEFGHFGH </code></pre> 'D' and 'FGH' is duplicated. 'F' appear two times in the sequence, however, they are not close to each other, so it does not duplicate. so our algorithm will return ['D', 'FGH']. I want to know whether there exists an elegant algorithm instead the brute force method?

Not very efficient (suffix tree/array are better for very large strings), but very short regular expression solution (C#): <pre class="prettyprint"><code> string source = @"ABCDDEFGHFGH"; string[] result = Regex .Matches(source, @"(.+)\1") .OfType<Match>() .Select(match => match.Groups[1].Value) .ToArray(); </code></pre> Explanation <pre class="prettyprint"><code>(.+) - group of any (at least 1) characters \1 - the same group (group #1) repeated </code></pre> Test <pre class="prettyprint"><code> Console.Write(string.Join(", ", result)); </code></pre> Outcome <pre class="prettyprint"><code> D, FGH </code></pre> In case of ambiguity, e.g. <code>"AAAA"</code> where we can provide <code>"AA"</code> as well as <code>"A"</code> the solution performs greedy and thus <code>"AA"</code> is returned.

Duplicate substring searching

Q: How do you find the longest repeated substring?

The maximum value of LCSRe(i, j) provides the length of the longest repeating substring and the substring itself can be found using the length and the ending index of the common suffix.

Tags:

substring

algorithm

search

Is there any efficient way to find the duplicate substring? Here, duplicate means that two same substring close to each other have the same value without overlap. For example, the source string is:

ABCDDEFGHFGH

'D' and 'FGH' is duplicated. 'F' appear two times in the sequence, however, they are not close to each other, so it does not duplicate. so our algorithm will return ['D', 'FGH']. I want to know whether there exists an elegant algorithm instead the brute force method?

982

asked Dec 22 '16 11:12

maple

2 Answers

It relates to Longest repeated substring problem, which builds Suffix Tree to provide string searching in linear time and space complexity Θ(n)

197

answered Sep 20 '22 13:09

Muhammad Faizan Uddin

Not very efficient (suffix tree/array are better for very large strings), but very short regular expression solution (C#):

  string source = @"ABCDDEFGHFGH";

  string[] result = Regex
    .Matches(source, @"(.+)\1")
    .OfType<Match>()
    .Select(match => match.Groups[1].Value)
    .ToArray();

Explanation

(.+) - group of any (at least 1) characters
\1   - the same group (group #1) repeated

Test

  Console.Write(string.Join(", ", result));

Outcome

  D, FGH

In case of ambiguity, e.g. "AAAA" where we can provide "AA" as well as "A" the solution performs greedy and thus "AA" is returned.

answered Sep 18 '22 13:09

Dmitry Bychenko

Related questions
                            
                                Searching for an element in log(n) time
                            
                                Possible multiplications of k distinct factors with largest possible factor n
                            
                                How to write a range-v3 action for random_shuffle?
                            
                                Would this algorithm run in O(n)?
                            
                                Common subsequence of given length
                            
                                Square with minimum area enclosing K points among given N points
                            
                                Optimizing construction of a trie over all substrings
                            
                                Algorithm to group objects
                            
                                using 10 MB of memory for four billion integers (about finding the optimized block size) [duplicate]
                            
                                read line by line in the most efficient way *platform specific*
                            
                                Algorithms - Find duration of overlapping intervals in a cyclic world (24 hours)
                            
                                Check that triangle is right?
                            
                                How do I sort an array of objects in reverse order efficiently?
                            
                                Travelling Salesman with multiple salesmen with a limit on number of cities per salesman?
                            
                                Check if box is covered by spheres
                            
                                Given a string, find two identical subsequences with consecutive indexes C++
                            
                                Finding all repeated substrings in a string and how often they appear
                            
                                Find the subarray with the max XOR from an array (using a trie)
                            
                                how many consecutive elements are smaller before each item in the array
                            
                                Algorithm - finding starting index of array so that sum of elements stays >= 0

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With