I am reading through documents, and splitting words to get each word in the dictionary, but how could I exclude some words (like "the/a/an"). This is my function: <pre class="prettyprint"><code>private void Splitter(string[] file) { try { tempDict = file .SelectMany(i => File.ReadAllLines(i) .SelectMany(line => line.Split(new[] { ' ', ',', '.', '?', '!', }, StringSplitOptions.RemoveEmptyEntries)) .AsParallel() .Distinct()) .GroupBy(word => word) .ToDictionary(g => g.Key, g => g.Count()); } catch (Exception ex) { Ex(ex); } } </code></pre> Also, in this scenario, where is the right place to add <code>.ToLower()</code> call to make all the words from file in lowercase? I was thinking about something like this before the (<code>temp = file</code>..): <pre class="prettyprint"><code>file.ToList().ConvertAll(d => d.ToLower()); </code></pre>

Do you want to filter out stop words? <pre class="prettyprint"><code> HashSet<String> StopWords = new HashSet<String> { "a", "an", "the" }; ... tempDict = file .SelectMany(i => File.ReadAllLines(i) .SelectMany(line => line.Split(new[] { ' ', ',', '.', '?', '!', }, StringSplitOptions.RemoveEmptyEntries)) .AsParallel() .Select(word => word.ToLower()) // <- To Lower case .Where(word => !StopWords.Contains(word)) // <- No stop words .Distinct() .GroupBy(word => word) .ToDictionary(g => g.Key, g => g.Count()); </code></pre> However, this code is a partial solution: proper names like Berlin will be converted into lower case: berlin as well as acronyms: KISS (Keep It Simple, Stupid) will become just a kiss and some numbers will be incorrect.

I would do this: <pre class="prettyprint"><code>var ignore = new [] { "the", "a", "an" }; tempDict = file .SelectMany(i => File .ReadAllLines(i) .SelectMany(line => line .ToLowerInvariant() .Split( new[] { ' ', ',', '.', '?', '!', }, StringSplitOptions.RemoveEmptyEntries)) .AsParallel() .Distinct()) .Where(x => !ignore.Contains(x)) .GroupBy(word => word) .ToDictionary(g => g.Key, g => g.Count()); </code></pre> You could change <code>ignore</code> to a <code>HashSet<string></code> if performance becomes an issue, but it would be unlikely since you are using file IO.

Excluding words from dictionary

Tags:

c#

dictionary

tolower

wpf

I am reading through documents, and splitting words to get each word in the dictionary, but how could I exclude some words (like "the/a/an").

This is my function:

Click to copy

private void Splitter(string[] file)
{
    try
    {
        tempDict = file
            .SelectMany(i => File.ReadAllLines(i)
            .SelectMany(line => line.Split(new[] { ' ', ',', '.', '?', '!', }, StringSplitOptions.RemoveEmptyEntries))
            .AsParallel()
            .Distinct())
            .GroupBy(word => word)
            .ToDictionary(g => g.Key, g => g.Count());
    }
    catch (Exception ex)
    {
        Ex(ex);
    }
}

Also, in this scenario, where is the right place to add .ToLower() call to make all the words from file in lowercase? I was thinking about something like this before the (temp = file..):

Click to copy

file.ToList().ConvertAll(d => d.ToLower());

641

asked May 15 '15 07:05

Ken'ichi Matsuyama

2 Answers

Do you want to filter out stop words?

Click to copy

 HashSet<String> StopWords = new HashSet<String> { 
   "a", "an", "the" 
 }; 

 ...

 tempDict = file
   .SelectMany(i => File.ReadAllLines(i)
   .SelectMany(line => line.Split(new[] { ' ', ',', '.', '?', '!', }, StringSplitOptions.RemoveEmptyEntries))
   .AsParallel()
   .Select(word => word.ToLower()) // <- To Lower case 
   .Where(word => !StopWords.Contains(word)) // <- No stop words
   .Distinct()
   .GroupBy(word => word)
   .ToDictionary(g => g.Key, g => g.Count());

However, this code is a partial solution: proper names like Berlin will be converted into lower case: berlin as well as acronyms: KISS (Keep It Simple, Stupid) will become just a kiss and some numbers will be incorrect.

176

answered Sep 29 '22 06:09

Dmitry Bychenko

I would do this:

Click to copy

var ignore = new [] { "the", "a", "an" };
tempDict = file
    .SelectMany(i =>
        File
            .ReadAllLines(i)
            .SelectMany(line =>
                line
                    .ToLowerInvariant()
                    .Split(
                        new[] { ' ', ',', '.', '?', '!', },
                        StringSplitOptions.RemoveEmptyEntries))
                    .AsParallel()
                    .Distinct())
    .Where(x => !ignore.Contains(x))
    .GroupBy(word => word)
    .ToDictionary(g => g.Key, g => g.Count());

You could change ignore to a HashSet<string> if performance becomes an issue, but it would be unlikely since you are using file IO.

answered Sep 29 '22 06:09

Enigmativity

Related questions
                            
                                Install Dot net 4.5 silently as adependency
                            
                                VS2015 Diagnostic with Code Fix - NuGet OR VSIX OR both?
                            
                                Add discriminator column as a part of a unique index in Entity Framework
                            
                                DbMigration.SqlFile difference in base directory
                            
                                Is an object creating using an "inline" new statement automatically disposed?
                            
                                Email Attachment from memory stream is coming as blank in C#
                            
                                How to get Documentation of method or class using Reflection? [duplicate]
                            
                                Application_Start vs serviceAutoStartProviders
                            
                                C# WPF ComboBox Mouse over color
                            
                                ASP vNext and PostgreSql
                            
                                Adding a new Language to SpeechSynthesizer
                            
                                Global instance of a class or static class with initialization method
                            
                                XUnit.net capture the result of each test right after it runs
                            
                                Building a dynamic where clause for dynamic keywords or using IQueryable C# Linq
                            
                                Is Utils file good practice? [closed]
                            
                                How to cancel Task but wait until it finishes?
                            
                                Is there a way to cast type parameter?
                            
                                DateTime.TryParseExact only working in "One Way"
                            
                                Why would you quote a LambdaExpression?
                            
                                Does SQL Azure automatically geo-replication automatically failover?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Excluding words from dictionary

Tags:

c#

dictionary

tolower

wpf

Ken'ichi Matsuyama

People also ask

2 Answers

Dmitry Bychenko

Enigmativity

Recent Activity

Donate For Us