Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find recurring word groups in text with C#? [closed]

Tags:

c#

regex

text

I'm getting recurring word counts in StringBuilder(sb) with this code which i've found on internet and according to writer it's really consistent like Word's word counter.

StringBuilder wordBuffer = new StringBuilder();
        int wordCount = 0;
        // 1. Build the list of words used. Consider ''' (apostrophe) and '-' (hyphen) a word continuation character.
        Dictionary<string, int> wordList = new Dictionary<string, int>();
        foreach (char c in sb.ToString())
        {

            if (char.IsLetter(c) || c == '\'' || c == '-')
            {
                wordBuffer.Append(char.ToLower(c));
            }
            else
            {
                if (wordBuffer.Length > 3)
                {
                    int count = 0;
                    string word = wordBuffer.ToString();
                    wordList.TryGetValue(word, out count);
                    wordList[word] = ++count;

                    wordBuffer.Clear();
                    wordCount++;
                }
            }
        }

This is my sample text:

The green algae (singular: green alga) are a large, informal grouping of algae consisting of the Chlorophyte and Charophyte algae, which are now placed in separate Divisions. The land plants or Embryophytes (higher plants) are thought to have emerged from the Charophytes.[1] As the embryophytes are not algae, and are therefore excluded, green algae are a paraphyletic group. However, the clade that includes both green algae and embryophytes is monophyletic and is referred to as the clade Viridiplantae and as the kingdom Plantae. The green algae include unicellular and colonial flagellates, most with two flagella per cell, as well as various colonial, coccoid and filamentous forms, and macroscopic, multicellular seaweeds. In the Charales, the closest relatives of higher plants, full cellular differentiation of tissues occurs. There are about 8,000 species of green algae.[2] Many species live most of their lives as single cells, while other species form coenobia (colonies), long filaments, or highly differentiated macroscopic seaweeds. A few other organisms rely on green algae to conduct photosynthesis for them. The chloroplasts in euglenids and chlorarachniophytes were acquired from ingested green algae,[1] and in the latter retain a nucleomorph (vestigial nucleus). Green algae are also found symbiotically in the ciliate Paramecium, and in Hydra viridissima and in flatworms. Some species of green algae, particularly of genera Trebouxia of the class Trebouxiophyceae and Trentepohlia (class Ulvophyceae), can be found in symbiotic associations with fungi to form lichens. In general the fungal species that partner in lichens cannot live on their own, while the algal species is often found living in nature without the fungus. Trentepohlia is a filamentous green alga that can live independently on humid soil, rocks or tree bark or form the photosymbiont in lichens of the family Graphidaceae.

With my sample text, I'm getting green and algae words in the first lines as expected.

Problem is, I don't need only single words, I need word groups too. With this example text, I want green algae words too, together with green and algae words.

And my optional problem is: I need to do it with high performance, because texts can be very long. As i researched it's not high performance to use RegEx with this case, but I'm not sure about if there is a second way to make it possible.

Thanks in advance.

UPDATE If you got what I'm asking about, you don't need to read these lines.
As I see too many comments about my "group" definiton is not clear, I think I need to state my point with more detail and I wished write these lines on comments section but it's a little narrow area for this update. Firstly, I know StackOverflow is not a coding service. I'm trying to find the most used word groups in an article and trying to decide what's article about, we can call it tag generator too. For this purpose I tried to find most used words and it was okay at the beginning. Then i realized it's not a good way to decide about topic because I can't assume the article is about only first or second word. In my example I can't say this article is only about green or algae because they mean something together here, not alone. If i try this with an article about a three named celebrity like "Helena Bonham Carter" (if I assume it's written full name along article, not only surname), I want to take these words together not one by one. I'm trying to achieve more clever algorithm which is guessing the topic in most accurate way and with one shot. I don't want to limit the word count because article may be about "United Nations Industrial Development Organization" (again I assume it's now written like "UNIDO" in article). And I can achieve this by trying to get every word group starting from any index to the end of text with any length. Okay it's not a good way really, especially with long texts but it's not impossible right? But i was looking for a better way to do this and I just asked about a better algorithm idea and best tool to use, I can write the code by myself. I hope I stated my goal clear finally.

like image 245
ErTR Avatar asked Nov 24 '15 05:11

ErTR


3 Answers

The way to achieve this is to take the initial text, and split it by whitespace into a string array using string.split(' ');

Next, you need to iterate over every string in the array. This is easy for single words, but more complex for groups. For this reason, you need to define a group size. You must control the number of places in the array the pointer advances for each iteration.

Once you are able to iterate the array, you need to grab the group of words (however long you have chosen it to be), and store it somewhere. Your dictionary in the example is a good approach.

If the dictionary contains the word group, you increment its value by one. If it doesn't contain the group, just add it with a default value of 1.

 if (wordList.ContainsKey(theKey)) {
   wordList[theKey]++;
 } else {
   wordList.Add(theKey, 1);
 }

You do rightly mention that your research showed that regex is not high performance. For this task, regex is completely the wrong tool - you're not looking for patterns, you're examining groups. For that, you have to go through the text from start to finish, checking values.

Any task that involves iterating through text and running a repeating function on it should never use regex.

EDIT: My initial assumption of the performance of Regex was not correct - in C#, it seems to be a great deal faster than in Java, yet I would still maintain that a pure regex approach is not as quick as using regex to tokenise the text then use either a loop or linq expression to find the groups.

Stating

@galakt As I mentioned above, let's say 3. Does it matter?

The idea of a word group is entirely abstract. Yes, it's a group of words, but the whole block of text is a group of words. Rules have to be applied to govern how you act on that group of words.

Below is a sample method which will return a dictionary of all the word groups based on a size passed via the method call. It does not strip any non-alphanumeric chars from the text, but it is fast, even with larger group sizes.

To call it, use Dictionary<String, int> SingleWordGroups = GetWordGroupInstances(1);

    private Dictionary<String, int> GetWordGroupInstances(int GroupSize) {

        Dictionary<String, int> WordGroupInstances = new Dictionary<string, int>();

        //Grab the string to work from...
        String[] sourceText = GetSourceText().Split(' ');
        int pointer = 0;
        StringBuilder groupBuilder = new StringBuilder();
        while (pointer < sourceText.Length - GroupSize) {
            groupBuilder.Clear();
            int offset = pointer + GroupSize;
            for (int i = pointer; i < offset; i++) {
                //prepend a space to allow separation between words in groups. 
                //We can make a substring from this later starting from char 1 
                //to lose the initial whitespace from the string.
                groupBuilder.Append(" ").Append(sourceText[i]);
            }

            String key = groupBuilder.ToString().Substring(1);
            if (!WordGroupInstances.ContainsKey(key)) {
                WordGroupInstances.Add(key, 1);
            } else {
                WordGroupInstances[key]++;
            }

            /**
             * Setting the pointer to increase by group size grabs a group, moves on
             * to the end of the group, and grabs the next.
             * 
             */
            pointer += GroupSize;

            /**
             * Setting the point to increment by 1 grabs a group, advances by 1 word, then
             * grabs the next, so from the phrase - "Hello world, I'm some text", the groups of size 2 would be
             * "Hello world,", "world, I'm", "I'm some" etc...
             */
            //pointer++;
        }

        return WordGroupInstances;

    }

The method below is modified to produce all the group output in turn, so The The Green Green Algae The Green Algae etc...

It's worth noting that the entire text must be converted to either lower or upper case so that words are not case dependent. I've refined this a little to improve the performance (and remove the need for the break instruction).

   private Dictionary<String, int> GetAllGroups() {
        Dictionary<string, int> WordGroupInstances = new Dictionary<string, int>();
        StringBuilder groupBuilder = new StringBuilder();
        String[] sourceText = GetSourceText().Split(' ');

        for (int i = 0; i < sourceText.Length; i++) {
            groupBuilder.Clear();
            for (int j = i; j < sourceText.Length; j++) {
                groupBuilder.Append(" ").Append(sourceText[j]);
                String key = groupBuilder.ToString().Substring(1);
                if (!WordGroupInstances.ContainsKey(key)) {
                    WordGroupInstances.Add(key, 1);
                } else {
                    WordGroupInstances[key]++;
                }
            }
        }
        return WordGroupInstances;
    }

After performance testing with the corpus of text (288 words), it will create the 41773 word groups in 0.171886 seconds.

like image 136
Alex Avatar answered Nov 14 '22 23:11

Alex


I think that this works fairly well.

var text = @"The green algae (singular: green alga) are ..."; // include all your text

var remove = "().,:[]0123456789".Select(x => x.ToString()).ToArray();

var words =
    Regex
        .Matches(text, @"(\S+)")
        .Cast<Match>()
        .SelectMany(x => x.Captures.Cast<Capture>())
        .Select(x => remove.Aggregate(x.Value, (t, r) => t.Replace(r, "")))
        .Select(x => x.Trim().ToLowerInvariant())
        .Where(x => !String.IsNullOrWhiteSpace(x))
        .ToArray();

var groups =
    from n1 in Enumerable.Range(0, words.Length)
    from n2 in Enumerable.Range(1, words.Length - n1)
    select String.Join(" ", words.Skip(n1).Take(n2));

var frequencies =
    groups
        .GroupBy(x => x)
        .Select(x => new { wordgroup = x.Key, count = x.Count() })
        .OrderByDescending(x => x.count)
        .ThenBy(x => x.wordgroup.Count(y => y == ' '))
        .ThenBy(x => x.wordgroup)
        .ToArray();

This gives me the frequency of every single word grouping of contiguous sequences of words including up to a single word group of all the words.

The number of words is 288. The total number of word groups is 288 x (288 + 1) / 2 = 41,616. The final number of word groups (after grouping duplicate word groups and removing empty/whitespace strings) is 41,449.

Here are the first 100 of these 41,449:

20 x "the", 13 x "and", 12 x "algae", 12 x "in", 11 x "green", 10 x "of", 9 x "green algae", 8 x "are", 6 x "as", 6 x "species", 5 x "a", 4 x "is", 4 x "or", 4 x "to", 3 x "embryophytes", 3 x "form", 3 x "found", 3 x "lichens", 3 x "live", 3 x "on", 3 x "plants", 3 x "that", 3 x "algae and", 3 x "and in", 3 x "as the", 3 x "in the", 3 x "of the", 2 x "alga", 2 x "can", 2 x "clade", 2 x "class", 2 x "colonial", 2 x "filamentous", 2 x "from", 2 x "higher", 2 x "macroscopic", 2 x "most", 2 x "other", 2 x "seaweeds", 2 x "their", 2 x "trentepohlia", 2 x "while", 2 x "with", 2 x "algae are", 2 x "are a", 2 x "green alga", 2 x "higher plants", 2 x "in lichens", 2 x "of green", 2 x "species of", 2 x "the clade", 2 x "the green", 2 x "green algae and", 2 x "green algae are", 2 x "of green algae", 2 x "species of green", 2 x "the green algae", 2 x "species of green algae", 1 x "about", 1 x "acquired", 1 x "algal", 1 x "also", 1 x "associations", 1 x "bark", 1 x "be", 1 x "both", 1 x "cannot", 1 x "cell", 1 x "cells", 1 x "cellular", 1 x "charales", 1 x "charophyte", 1 x "charophytes", 1 x "chlorarachniophytes", 1 x "chlorophyte", 1 x "chloroplasts", 1 x "ciliate", 1 x "closest", 1 x "coccoid", 1 x "coenobia", 1 x "colonies", 1 x "conduct", 1 x "consisting", 1 x "differentiated", 1 x "differentiation", 1 x "divisions", 1 x "emerged", 1 x "euglenids", 1 x "excluded", 1 x "family", 1 x "few", 1 x "filaments", 1 x "flagella", 1 x "flagellates", 1 x "flatworms", 1 x "for", 1 x "forms", 1 x "full", 1 x "fungal", 1 x "fungi"

like image 26
Enigmativity Avatar answered Nov 14 '22 22:11

Enigmativity


Here is a streaming approach that recursively builds up groups of size N (3 in this example) from an enumerable of words. It doesn't matter how you tokenize your input into words (I've used a simple regex in this example).

//tokenize input (enumerable of string)
var words = Regex.Matches(input, @"\w+").Cast<Match>().Select(m => m.Value);

//get word groups (enumerable of string[])
var groups = GetWordGroups(words, 3);

//do what you want with your groups; suppose you want to count them
var counts = new Dictionary<string, int>(StringComparer.CurrentCultureIgnoreCase);
foreach (var group in groups.Select(g => string.Join(" ", g)))
{
    int count;
    counts.TryGetValue(group, out count);
    counts[group] = ++count;
}


IEnumerable<string[]> GetWordGroups(IEnumerable<string> words, int size)
{
    if (size <= 0) throw new ArgumentOutOfRangeException();
    if (size == 1)
    {
        foreach (var word in words)
        {
            yield return new string[] { word };
        }

        yield break;
    }

    var prev = new string[0];

    foreach (var next in GetWordGroups(words, size - 1))
    {
        yield return next;

        //stream of groups includes all groups up to size - 1, but we only combine groups of size - 1
        if (next.Length == size - 1)
        {
            if (prev.Length == size - 1)
            {
                var group = new string[size];
                Array.Copy(prev, 0, group, 0, prev.Length);
                group[group.Length - 1] = next[next.Length - 1];
                yield return group;
            }

            prev = next;
        }
    }
}

One advantage with a streaming approach such as this is you minimize the number of strings that must be held in memory at any time (this reduces memory use for large bodies of text). Depending on how you receive your input, another optimization may be to use a TextReader to produce an enumeration of tokens as you read the input.

An example of the intermediate grouping output follows (each item is actually array of tokens, joined with a white space for output here):

The 
green 
The green 
algae 
green algae 
The green algae 
singular 
algae singular 
green algae singular 
green 
singular green 
algae singular green 
alga 
green alga 
singular green alga 
like image 37
Michael Petito Avatar answered Nov 14 '22 23:11

Michael Petito