Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Can I Add More Languages to Stopwords in NLTK?

I'm using NLTK with stopwords to detect the language of a document using the method described by Alejandro Nolla at http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/ , and it works reasonably well.

I'm also working with some additional languages not included in the NLTK stopwords package, such as Czech and Romanian, and they get false matches as other languages. These are the languages in stopwords:

['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish', 'turkish']

How can I expand the list of languages supported by NLTK? Are there other stopword lists available that I can add? Is there a documented method I can use to create an add my own stopword lists?

like image 813
Jason Champion Avatar asked Jan 26 '14 18:01

Jason Champion


1 Answers

Googling for "Romanian stopwords" brings up a good number of resources.

If you want to do this yourself, you simply need to find words which are common in all genres of text. (The article you link to has a rather poor explanation of what stop words are.) Good candidates are articles, particles (if your language has them, and they occur in isolation), conjunctions, pronouns, and some types of adverbs.

Automatically Building a Stopword List for an Information Retrieval System (Rachel Tsz-Wai Lo, Ben He, Iadh Ounis; University of Glasgow, 2008) (PDF) documents an automatic method for finding stop words. I have not looked at the method or its results.

https://github.com/berkmancenter/mediacloud/blob/master/script/mediawords_generate_stopwords.pl seems to have an implementation. (The comment has other names than the article; not sure what's up with that.)

like image 168
tripleee Avatar answered Oct 11 '22 13:10

tripleee