Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find number of repeating substrings in a string

I am looking for an algorithm that will find the number of repeating substrings in a single string.

For this, I was looking for some dynamic programming algorithms but didn't find any that would help me. I just want some tutorial on how to do this.

Let's say I have a string ABCDABCDABCD. The expected output for this would be 3, because there is ABCD 3 times.

For input AAAA, output would be 4, since A is repeated 4 times.

For input ASDF, output would be 1, since every individual character is repeated 1 time only.

I hope that someone can point me in the right direction. Thank you.

like image 722
mereth Avatar asked Mar 31 '17 19:03

mereth


People also ask

How do you count repeated substrings in Python?

Python has a built-in function for counting the repeated substring in a given string called count(). As the name suggests, it counts the occurrence of a substring in a given string.

How many substrings exist of a string?

If the length of string is N, then there will be N*(N+1)/2 possible substrings.

How many substrings are in a string Python?

Python string count() is an inbuilt function that returns the number of occurrences of the substring in the given string. The count() method searches the substring in the given string and returns how many times the substring is present in it.


Video Answer


2 Answers

I am taking the following assumptions:

  • The repeating substrings must be consecutive. That is, in case of ABCDABC, ABC would not count as a repeating substring, but it would in case of ABCABC.
  • The repeating substrings must be non-overalpping. That is, in case of ABCABC, ABC would not count as a repeating substring.
  • In case of multiple possible answers, we want the one with the maximum value. That is, in the case of AAAA, the answer should be 4 (a is the substring) rather than 2 (aa is the substring).

Under these assumptions, the algorithm is as follows:

  • Let the input string be denoted as inputString.
  • Calculate the KMP failure function array for the input string. Let this array be denoted as failure[]. This operation if of linear time complexity with respect to the length of the string. So, by definition, failure[i] denotes the length of the longest proper-prefix of the substring inputString[0....i] that is also a proper-suffix of the same substring.
  • Let len = inputString.length - failure.lastIndexValue. At this point, we know that if there is any repeating string at all, then it has to be of this length len. But we'll need to check for that; First, just check if len perfectly divides inputString.length (that is, inputString.length % len == 0). If yes, then check if every consecutive (non-overlapping) substring of len characters is the same or not; this operation is again of linear time complexity with respect to the length of the input string.
  • If it turns out that every consecutive non-overlapping substring is the same, then the answer would be = inputString.length/ len. Otherwise, the answer is simply inputString.length, as there is no such repeating substring present.

The overall time complexity would be O(n), where n is the number of characters in the input string.

A sample code for calculating the KMP failure array is given here.


For example,

Let the input string be abcaabcaabca.

Its KMP failure array would be - [0, 0, 0, 1, 1, 2, 3, 4, 5, 6, 7, 8].

So, our len = (12 - 8) = 4.

And every consecutive non-overlapping substring of length 4 is the same (abca).
Therefore the answer is 12/4 = 3. That is, abca is repeated 3 times repeatedly.

like image 112
Anmol Singh Jaggi Avatar answered Nov 07 '22 06:11

Anmol Singh Jaggi


The solution for this with C# is:

 class Program
 {
     public static string CountOfRepeatedSubstring(string str)
     {
         if (str.Length < 2)
         {
             return "-1";
         }

         StringBuilder substr = new StringBuilder();

         // Length of the substring cannot be greater than half of the actual string
         for (int i = 0; i < str.Length / 2; i++)
         {
             // We will iterate through half of the actual string and
             // create a new string by appending the current character to the previous character
             substr.Append(str[i]);

             String clearedOfNewSubstrings = str.Replace(substr.ToString(), "");

             // We will remove the newly created substring from the actual string and 
             // check if the length of the actual string, cleared of the newly created substring, is 0.
             // If 0 it tells us that it is only made of its substring
             if (clearedOfNewSubstrings.Length == 0)
             {
                 // Next we will return the count of the newly created substring in the actual string.
                 var countOccurences = Regex.Matches(str, substr.ToString()).Count;

                 return countOccurences.ToString();
             }
         }

         return "-1";
     }       

     static void Main(string[] args)
     {
         // Input: {"abcdaabcdaabcda"}
         // Output: 3

         // Input: { "abcdaabcdaabcda" }
         // Output: -1

         // Input: {"barrybarrybarry"}
         // Output: 3            

         var s = "asdf"; // Output will be -1

         Console.WriteLine(CountOfRepeatedSubstring(s));
     }
 }
like image 43
j4jada Avatar answered Nov 07 '22 07:11

j4jada