Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Excluding words from dictionary

I am reading through documents, and splitting words to get each word in the dictionary, but how could I exclude some words (like "the/a/an").

This is my function:

private void Splitter(string[] file)
{
    try
    {
        tempDict = file
            .SelectMany(i => File.ReadAllLines(i)
            .SelectMany(line => line.Split(new[] { ' ', ',', '.', '?', '!', }, StringSplitOptions.RemoveEmptyEntries))
            .AsParallel()
            .Distinct())
            .GroupBy(word => word)
            .ToDictionary(g => g.Key, g => g.Count());
    }
    catch (Exception ex)
    {
        Ex(ex);
    }
}

Also, in this scenario, where is the right place to add .ToLower() call to make all the words from file in lowercase? I was thinking about something like this before the (temp = file..):

file.ToList().ConvertAll(d => d.ToLower());
like image 641
Ken'ichi Matsuyama Avatar asked May 15 '15 07:05

Ken'ichi Matsuyama


People also ask

Can you remove words from the dictionary?

To delete a word, select it in the Dictionary box, and then select Delete. To edit a word, delete it, and then add it with the spelling you want. To remove all words, select Delete all.

What is a Exclusion dictionary?

An exclusion dictionary is a list of words that will always be marked as incorrect by Word's Editor even if they are spelled correctly. You may want to add frequently confused or “inappropriate” words to your exclusion dictionary.

What is an example of excluded?

Exclude is defined as to keep out or to refuse to admit. An example of exclude is for a group of children to tell another child that he cannot play with them. To expel; to put out. To exclude young animals from the womb or from eggs.


2 Answers

Do you want to filter out stop words?

 HashSet<String> StopWords = new HashSet<String> { 
   "a", "an", "the" 
 }; 

 ...

 tempDict = file
   .SelectMany(i => File.ReadAllLines(i)
   .SelectMany(line => line.Split(new[] { ' ', ',', '.', '?', '!', }, StringSplitOptions.RemoveEmptyEntries))
   .AsParallel()
   .Select(word => word.ToLower()) // <- To Lower case 
   .Where(word => !StopWords.Contains(word)) // <- No stop words
   .Distinct()
   .GroupBy(word => word)
   .ToDictionary(g => g.Key, g => g.Count());

However, this code is a partial solution: proper names like Berlin will be converted into lower case: berlin as well as acronyms: KISS (Keep It Simple, Stupid) will become just a kiss and some numbers will be incorrect.

like image 176
Dmitry Bychenko Avatar answered Sep 29 '22 06:09

Dmitry Bychenko


I would do this:

var ignore = new [] { "the", "a", "an" };
tempDict = file
    .SelectMany(i =>
        File
            .ReadAllLines(i)
            .SelectMany(line =>
                line
                    .ToLowerInvariant()
                    .Split(
                        new[] { ' ', ',', '.', '?', '!', },
                        StringSplitOptions.RemoveEmptyEntries))
                    .AsParallel()
                    .Distinct())
    .Where(x => !ignore.Contains(x))
    .GroupBy(word => word)
    .ToDictionary(g => g.Key, g => g.Count());

You could change ignore to a HashSet<string> if performance becomes an issue, but it would be unlikely since you are using file IO.

like image 23
Enigmativity Avatar answered Sep 29 '22 06:09

Enigmativity