Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK available languages for stopwords

I'm wondering where I can find the full list of supported langs (and their keys) for the NLTK stopwords.

I find a list in https://pypi.org/project/stop-words/ but it does not contain the keys for each country. So, it is not clear if you can retrieve the list by simply stopwords.words("Bulgarian"). In fact, that will throw an error.

I checked in the NLTK site and there are 4 documents matching "stopwords" but none of them describes that. https://www.nltk.org/search.html?q=stopwords&check_keywords=yes&area=default

And nothing is sayd in their book: http://www.nltk.org/book/ch02.html#stopwords_index_term

So, do you know where can I find the list of keys?

like image 994
gal007 Avatar asked Feb 07 '19 12:02

gal007


People also ask

What words are in NLTK Stopwords?

By default, NLTK (Natural Language Toolkit) includes a list of 40 stop words, including: β€œa”, β€œan”, β€œthe”, β€œof”, β€œin”, etc. The stopwords in nltk are the most common words in data. They are words that you do not want to use to describe the topic of your content. They are pre-defined and cannot be removed.

What are Stopwords NLP?

What are stop words? πŸ€” The words which are generally filtered out before processing a natural language are called stop words. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text.


2 Answers

First check if you have downloaded nltk packages.
If not you can download it using below:

import nltk
nltk.download()

After this you can find stopword language files in below path.

C:/Users/username/AppData/Roming/nltk_data/corpora/stopwords

There are 21 languages supported by it (I installed nltk few days back, so this number must be up to date). You can pass filename as parameter in

nltk.corpus.stopwords.words('langauage')

like image 81
Sociopath Avatar answered Nov 08 '22 08:11

Sociopath


When you import the stopwords using:

from nltk.corpus import stopwords
english_stopwords = stopwords.words(language)

you are retrieving the stopwords based upon the fileid (language). In order to see all available stopword languages, you can retrieve the list of fileids using:

from nltk.corpus import stopwords
print(stopwords.fileids())

in the case of nltk v3.4.5, this returns 23 languages:

['arabic', 
 'azerbaijani', 
 'danish', 
 'dutch', 
 'english', 
 'finnish', 
 'french', 
 'german', 
 'greek',
 'hungarian', 
 'indonesian', 
 'italian', 
 'kazakh', 
 'nepali', 
 'norwegian', 
 'portuguese', 
 'romanian', 
 'russian', 
 'slovene', 
 'spanish', 
 'swedish', 
 'tajik', 
 'turkish']
like image 30
thechill Avatar answered Nov 08 '22 06:11

thechill