Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Stop words" list for English? [closed]

I'm generating some statistics for some English-language text and I would like to skip uninteresting words such as "a" and "the".

  • Where can I find some lists of these uninteresting words?
  • Is a list of these words the same as a list of the most frequently used words in English?

update: these are apparently called "stop words" and not "skip words".

like image 840
Mark Harrison Avatar asked Aug 02 '09 07:08

Mark Harrison


3 Answers

The magic word to put into Google is "stop words". This turns up a reasonable-looking list.

MySQL also has a built-in list of stop words, but this is far too comprehensive to my tastes. For example, at our university library we had problems because "third" in "third world" was considered a stop word.

like image 131
Thomas Avatar answered Sep 22 '22 06:09

Thomas


these are called stop words, check this sample

like image 40
Ahmed Avatar answered Sep 25 '22 06:09

Ahmed


Depending on the subdomain of English you are working in, you may have/wish to compile your own stop word list. Some generic stop words could be meaningful in a domain. E.g. The word "are" could actually be an abbreviation/acronym in some domain. Conversely, you may want to ignore some domain specific words depending on your application which you may not want to ignore in the domain of general English. E.g. If you are analyzing a corpus of hospital reports, you may wish to ignore words like 'history' and 'symptoms' as they would be found in every report and may not be useful (from a plain vanilla inverted index perspective).

Otherwise, the lists returned by Google should be fine. The Porter Stemmer uses this and the Lucene seach engine implementation uses this.

like image 41
hashable Avatar answered Sep 25 '22 06:09

hashable