Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Known list of "filler" words; how to rip good keywords using C#?

Tags:

c#

If I've got a block of text, in English, what's the best method of clearing away all the "filler" words like "the, it, or, we, us", etc... leaving only viable words to be considered the real, core, content of the text?

I'm brainstorming a way to automatically tie blocks of text together based on how similar they are in keyword composition.

I can't be the first one to imagine this. Is there a popular, effective way this can be accomplished using C#?

Update

I am trying to essentially link one block of text, to n "related" blocks of text, where the primary "content" is so similar that it could be considered additional information to the text it is related to...

like image 321
Chaddeus Avatar asked Dec 26 '22 22:12

Chaddeus


1 Answers

This thing is called stop words - words that are usually1 not essential for understanding the data, and are removed by indexers.

Almost any Information Retrieval system I am aware of implements a tokenizer that filter these words.

I am familiar with java's lucene, that has StandardAnalyzer that does it for you, but I assume this analyzer also exists in lucene.net - you may want to track it and use it.

You might also be interested in stemming, which is also done in lucene by EnglishAnalyzer for instance.


(1) Why usually? In sarcasm ditactors, for example - it seems (empirically) that stop words are critical to get good results.

like image 122
amit Avatar answered Feb 19 '23 22:02

amit