Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract keywords from text in .NET

I need to calculate how many times each keyword is reoccurring in a string, with sorting by highest number. What's the fastest algorithm available in .NET code for this purpose?

like image 315
SharpAffair Avatar asked Dec 13 '22 18:12

SharpAffair


2 Answers

EDIT: code below groups unique tokens with count

string[] target = src.Split(new char[] { ' ' });

var results = target.GroupBy(t => new
{
    str = t,
    count = target.Count(sub => sub.Equals(t))
});

This is finally starting to make more sense to me...

EDIT: code below results in count correlated with target substring:

string src = "for each character in the string, take the rest of the " +
    "string starting from that character " +
    "as a substring; count it if it starts with the target string";
string[] target = {"string", "the", "in"};

var results = target.Select((t, index) => new {str = t, 
    count = src.Select((c, i) => src.Substring(i)).
    Count(sub => sub.StartsWith(t))});

Results is now:

+       [0] { str = "string", count = 4 }   <Anonymous Type>
+       [1] { str = "the", count = 4 }  <Anonymous Type>
+       [2] { str = "in", count = 6 }   <Anonymous Type>

Original code below:

string src = "for each character in the string, take the rest of the " +
    "string starting from that character " +
    "as a substring; count it if it starts with the target string";
string[] target = {"string", "the", "in"};

var results = target.Select(t => src.Select((c, i) => src.Substring(i)).
    Count(sub => sub.StartsWith(t))).OrderByDescending(t => t);

with grateful acknowledgement to this previous response.

Results from debugger (which need extra logic to include the matching string with its count):

-       results {System.Linq.OrderedEnumerable<int,int>}    
-       Results View    Expanding the Results View will enumerate the IEnumerable   
        [0] 6   int
        [1] 4   int
        [2] 4   int
like image 197
Steve Townsend Avatar answered Jan 04 '23 23:01

Steve Townsend


Dunno about fastest, but Linq is probably the most understandable:

var myListOfKeywords = new [] {"struct", "public", ...};

var keywordCount = from keyword in myProgramText.Split(new []{" ","(", ...})
   group by keyword into g
   where myListOfKeywords.Contains(g.Key)
   select new {g.Key, g.Count()}

foreach(var element in keywordCount)
   Console.WriteLine(String.Format("Keyword: {0}, Count: {1}", element.Key, element.Count));

You can write this in a non-Linq-y way, but the basic premise is the same; split the string up into words, and count the occurrences of each word of interest.

like image 43
KeithS Avatar answered Jan 04 '23 22:01

KeithS