I am reading through documents, and splitting words to get each word in the dictionary, but how could I exclude some words (like "the/a/an").
This is my function:
private void Splitter(string[] file)
{
try
{
tempDict = file
.SelectMany(i => File.ReadAllLines(i)
.SelectMany(line => line.Split(new[] { ' ', ',', '.', '?', '!', }, StringSplitOptions.RemoveEmptyEntries))
.AsParallel()
.Distinct())
.GroupBy(word => word)
.ToDictionary(g => g.Key, g => g.Count());
}
catch (Exception ex)
{
Ex(ex);
}
}
Also, in this scenario, where is the right place to add .ToLower()
call to make all the words from file in lowercase? I was thinking about something like this before the (temp = file
..):
file.ToList().ConvertAll(d => d.ToLower());
To delete a word, select it in the Dictionary box, and then select Delete. To edit a word, delete it, and then add it with the spelling you want. To remove all words, select Delete all.
An exclusion dictionary is a list of words that will always be marked as incorrect by Word's Editor even if they are spelled correctly. You may want to add frequently confused or “inappropriate” words to your exclusion dictionary.
Exclude is defined as to keep out or to refuse to admit. An example of exclude is for a group of children to tell another child that he cannot play with them. To expel; to put out. To exclude young animals from the womb or from eggs.
Do you want to filter out stop words?
HashSet<String> StopWords = new HashSet<String> {
"a", "an", "the"
};
...
tempDict = file
.SelectMany(i => File.ReadAllLines(i)
.SelectMany(line => line.Split(new[] { ' ', ',', '.', '?', '!', }, StringSplitOptions.RemoveEmptyEntries))
.AsParallel()
.Select(word => word.ToLower()) // <- To Lower case
.Where(word => !StopWords.Contains(word)) // <- No stop words
.Distinct()
.GroupBy(word => word)
.ToDictionary(g => g.Key, g => g.Count());
However, this code is a partial solution: proper names like Berlin will be converted into lower case: berlin as well as acronyms: KISS (Keep It Simple, Stupid) will become just a kiss and some numbers will be incorrect.
I would do this:
var ignore = new [] { "the", "a", "an" };
tempDict = file
.SelectMany(i =>
File
.ReadAllLines(i)
.SelectMany(line =>
line
.ToLowerInvariant()
.Split(
new[] { ' ', ',', '.', '?', '!', },
StringSplitOptions.RemoveEmptyEntries))
.AsParallel()
.Distinct())
.Where(x => !ignore.Contains(x))
.GroupBy(word => word)
.ToDictionary(g => g.Key, g => g.Count());
You could change ignore
to a HashSet<string>
if performance becomes an issue, but it would be unlikely since you are using file IO.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With