I'm wondering where I can find the full list of supported langs (and their keys) for the NLTK stopwords.
I find a list in https://pypi.org/project/stop-words/ but it does not contain the keys for each country. So, it is not clear if you can retrieve the list by simply stopwords.words("Bulgarian")
. In fact, that will throw an error.
I checked in the NLTK site and there are 4 documents matching "stopwords" but none of them describes that. https://www.nltk.org/search.html?q=stopwords&check_keywords=yes&area=default
And nothing is sayd in their book: http://www.nltk.org/book/ch02.html#stopwords_index_term
So, do you know where can I find the list of keys?
By default, NLTK (Natural Language Toolkit) includes a list of 40 stop words, including: βaβ, βanβ, βtheβ, βofβ, βinβ, etc. The stopwords in nltk are the most common words in data. They are words that you do not want to use to describe the topic of your content. They are pre-defined and cannot be removed.
What are stop words? π€ The words which are generally filtered out before processing a natural language are called stop words. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text.
First check if you have downloaded nltk
packages.
If not you can download it using below:
import nltk
nltk.download()
After this you can find stopword language files in below path.
C:/Users/username/AppData/Roming/nltk_data/corpora/stopwords
There are 21 languages supported by it (I installed nltk
few days back, so this number must be up to date). You can pass filename as parameter in
nltk.corpus.stopwords.words('langauage')
When you import the stopwords using:
from nltk.corpus import stopwords
english_stopwords = stopwords.words(language)
you are retrieving the stopwords based upon the fileid (language). In order to see all available stopword languages, you can retrieve the list of fileids using:
from nltk.corpus import stopwords
print(stopwords.fileids())
in the case of nltk v3.4.5, this returns 23 languages:
['arabic',
'azerbaijani',
'danish',
'dutch',
'english',
'finnish',
'french',
'german',
'greek',
'hungarian',
'indonesian',
'italian',
'kazakh',
'nepali',
'norwegian',
'portuguese',
'romanian',
'russian',
'slovene',
'spanish',
'swedish',
'tajik',
'turkish']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With