I am thinking about writing a program to collect for me the most common phrases in a large volume of the text. Had the problem been reduced to just finding words than that would be as simple as storing each new word in a hashmap and then increasing the count on each occurrence. But with phrases, storing each permutation of a sentence as a key seems infeasible.
Basically the problem is narrowed down to figuring out how to extract every possible phrase from a large enough text. Counting the phrases and then sorting by the number of occurrences becomes trivial.
I assume that you are searching for common patterns of consecutive words appearing in the same order (e.g. "top of the world" would not be counted as the same phrase as "top of a world" or "the world of top").
If so then I would recommend the following linear-time approach:
You are now in a position to collect your common phrases.
It is not quite clear how you wish to determine the end of a phrase. One possibility is to simply collect all sequences of 4 words that repeat.
This can be done in linear time by working through your suffix array looking at places where the longest common prefix array is >= 4. Each run of indices x in the range [start+1...start+len] where the LCP[x] >= 4 (for all except the last value of x) corresponds to a phrase that is repeated len times. The phrase itself is given by the first 4 words of, for example, suffix start+1.
Note that this approach will potentially spot phrases that cross sentence ends. You may prefer to convert some punctuation such as full stops into unique integers to prevent this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With