Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find duplicate phrases in a large string

I am trying to figure out an efficient way to find duplicate phrases in a large string. The string will contain hundreds or thousands of words separated by an empty space. I've included code below that I am currently using but it is very inefficient in finding duplicate phrases.

    public static string FindDuplicateSubstringFast(string s, string keyword, bool allowOverlap = true)
{
    int matchPos = 0, maxLength = 0;
    if (s.ToLower().Contains(keyword.ToLower()))
        for (int shift = 1; shift < s.Length; shift++)
        {
            int matchCount = 0;
            for (int i = 0; i < s.Length - shift; i++)
            {

                if (s[i] == s[i + shift])
                {
                    matchCount++;
                    if (matchCount > maxLength)
                    {
                        maxLength = matchCount;
                        matchPos = i - matchCount + 1;
                    }
                    if (!allowOverlap && (matchCount == shift))
                    {
                        // we have found the largest allowable match 
                        // for this shift.
                        break;
                    }
                }
                else matchCount = 0;
            }
        }
    string newbs = s.Substring(matchPos, maxLength);
    if (maxLength > 3) return s.Substring(matchPos, maxLength);
    else return null;
}

I found the example code above @ Find duplicate content in string?

This method is going through every char and I would like to find a way to loop through each word. I'm not sure what would be the best way to do this. I was thinking I could split the string on the empty spaces and then put the words into a list. Iterating through a list should be way more efficient than iterating over every char like I am doing now. However, I don't know how I would iterate through the list and find duplicate phrases.

If anyone could help me figure out an algorithm to iterate through a list to find duplicate phrases, I would be very grateful. I would also be open to any other ideas or methods to find duplicate phrases within a large string.

Please let me know if any more info is needed.

EDIT: Here is an example of a large string {its small for this example}

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s.

For example sake "Lorem Ipsum" would be the duplicate phrase. I need to return "Lorem Ipsum" and any other duplicate phrases that appear in the string more than once.

like image 385
Keefe Higgins Avatar asked Dec 03 '25 21:12

Keefe Higgins


1 Answers

string[] split = BigString.Split(' ').ToLower();
var duplicates = new Dictionary<string, int>();
for (int i = 0;i<split.Length;i++)
{
    int j=i;
    string s = split[i] + " ";
    while(i+j<split.Length)
    {
        j++;
        s += split[j] + " ";
        if (Regex.Matches(BigString.ToLower(), s).Count ==1) break;
        duplicates[s] = Regex.Matches(BigString.ToLower(), s).Count;
    }
}

Now, the dictionary will contain all the phrases and "subphrases" e.g. "Lorem Ipsum Dolor" will find "Lorem Ipsum" and "Lorem Ipsum Dolor". If that's not interesting to you, it's just a matter of looping through the Keys Collection of duplicates. If one key is a substring of another key, and their value is the same, remove said key.

like image 164
jose Avatar answered Dec 05 '25 12:12

jose



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!