Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to see if keywords exist inside a string

Let's say I have a set of keywords in an array {"olympics", "sports tennis best", "tennis", "tennis rules"}

I then have a large list (up to 50 at a time) of strings (or actually tweets), so they are a max of 140 characters.

I want to look at each string and see what keywords are present there. In the case where a keyword is composed of multiple words like "sports tennis best", the words don't have to be together in the string, but all of them have to show up.

I've having trouble figuring out an algorithm that does this efficiently.

Do you guys have suggestions on a way to do this? Thanks!

Edit: To explain a bit better each keyword has an id associated with it, so {1:"olympics", 2:"sports tennis best", 3:"tennis", 4:"tennis rules"}

I want to go through the list of strings/tweets and see which group of keywords match. The output should be, this tweet belongs with keyword #4. (multiple matches may be made, so anything that matches keyword 2, would also match 3 -since they both contain tennis).

When there are multiple words in the keyword, e.g. "sports tennis best" they don't have to appear together but have to all appear. e.g. this will correctly match: "i just played tennis, i love sports, its the best"... since this string contains "sports tennis best" it will match and be associated with the keywordID (which is 2 for this example).

Edit 2: Case insensitive.

like image 976
rksprst Avatar asked Apr 22 '10 16:04

rksprst


3 Answers

IEnumerable<string> tweets, keywords;

var x = tweets.Select(t => new
                           {
                               Tweet = t,
                               Keywords = keywords.Where(k => k.Split(' ')
                                                               .All(t.Contains))
                                                  .ToArray()
                           });
like image 179
dtb Avatar answered Nov 09 '22 10:11

dtb


Multiple patterns can be searched very efficiently using several algorithms such as the algorithm of Aho-Corasick (using a trie) or the one from Wu and Manber.

If performance is critical, I suggest taking either of those. To search in multiple strings, it may be most efficient to concatenate all your 50 strings into one larger string, keeping book of the starting positions of individual strings.

like image 33
Konrad Rudolph Avatar answered Nov 09 '22 10:11

Konrad Rudolph


Maybe something like this?

        string[] keywords = new string[] {"olympics", "sports tennis best", "tennis", "tennis rules"};

        string testString = "I like sports and the olympics and think tennis is best.";

        string[] usedKeywords = keywords.Where(keyword => keyword.Split(' ').All(s => testString.Contains(s))).ToArray();
like image 27
Robin Day Avatar answered Nov 09 '22 10:11

Robin Day