Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What techniques/tools are there for discovering common phrases in chunks of text?

Lets say I have 100000 email bodies and 2000 of them contains an abitrary common string like "the quick brown fox jumps over the lazy dog" or "lorem ipsum dolor sit amet". What techniques could/should I use to "mine" these phrases? I'm not interested in mining single words or short phrases. Also I need to filter out phrases that I already know occur in all mails.

Example:

string mailbody1 = "Welcome to the world of tomorrow! This is the first mail body. Lorem ipsum dolor sit AMET. Have a nice day dude. Cya!";
string mailbody2 = "Welcome to the world of yesterday! Lorem ipsum dolor sit amet Please note this is the body of the second mail. Have a nice day.";
string mailbody3 = "A completely different body.";
string[] mailbodies = new[] {mailbody1, mailbody2, mailbody3};
string[] ignoredPhrases = new[] {"Welcome to the world of"};

string[] results = DiscoverPhrases(mailbodies, ignoredPhrases);

In this example I want the DiscoverPhrases function to return "lorem ipsum dolor sit amet" and "have a nice day". It's not that important if the function also returns shorter "noise" phrases, but if its possible it would be nice to eliminate these in the process.

Edit: I forgot to include mailbody3 in the example.

like image 901
JohannesH Avatar asked Sep 15 '09 10:09

JohannesH


People also ask

How do you search for keywords in a text?

You can use a keyword extractor to pull out single words (keywords) or groups of two or more words that create a phrase (key phrases). Try the keyword extractor, below, using your own text to pull out single words (keywords) or groups of two or more words that create a phrase (key phrases).


1 Answers

Have a look at N-grams. The most common phrases will necessarily contribute the most common N-grams. I'd start out with word trigrams and see where that leads. (Space required is N times the length of the text, so you can't let N get too big.) If you save the positions and not just a count, you can then see if the trigrams can be extended to form common phrases.

like image 196
Norman Ramsey Avatar answered Sep 29 '22 13:09

Norman Ramsey