Let's say I have a set of keywords in an array {"olympics", "sports tennis best", "tennis", "tennis rules"}
I then have a large list (up to 50 at a time) of strings (or actually tweets), so they are a max of 140 characters.
I want to look at each string and see what keywords are present there. In the case where a keyword is composed of multiple words like "sports tennis best", the words don't have to be together in the string, but all of them have to show up.
I've having trouble figuring out an algorithm that does this efficiently.
Do you guys have suggestions on a way to do this? Thanks!
Edit: To explain a bit better each keyword has an id associated with it, so {1:"olympics", 2:"sports tennis best", 3:"tennis", 4:"tennis rules"}
I want to go through the list of strings/tweets and see which group of keywords match. The output should be, this tweet belongs with keyword #4. (multiple matches may be made, so anything that matches keyword 2, would also match 3 -since they both contain tennis).
When there are multiple words in the keyword, e.g. "sports tennis best" they don't have to appear together but have to all appear. e.g. this will correctly match: "i just played tennis, i love sports, its the best"... since this string contains "sports tennis best" it will match and be associated with the keywordID (which is 2 for this example).
Edit 2: Case insensitive.
IEnumerable<string> tweets, keywords;
var x = tweets.Select(t => new
{
Tweet = t,
Keywords = keywords.Where(k => k.Split(' ')
.All(t.Contains))
.ToArray()
});
Multiple patterns can be searched very efficiently using several algorithms such as the algorithm of Aho-Corasick (using a trie) or the one from Wu and Manber.
If performance is critical, I suggest taking either of those. To search in multiple strings, it may be most efficient to concatenate all your 50 strings into one larger string, keeping book of the starting positions of individual strings.
Maybe something like this?
string[] keywords = new string[] {"olympics", "sports tennis best", "tennis", "tennis rules"};
string testString = "I like sports and the olympics and think tennis is best.";
string[] usedKeywords = keywords.Where(keyword => keyword.Split(' ').All(s => testString.Contains(s))).ToArray();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With