Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove Stopwords in French AND English in TfidfVectorizer

I am trying to remove stopwords in French and English in TfidfVectorizer. So far, I've only managed to remove stopwords from the English language. When I try to enter the French language for the stop_words, I get an error message that says it's not built-in.

In fact, I get the following error message:

ValueError: not a built-in stop list: french

I have a text document containing 700 lines of text mixed in French and English.

I am doing a clustering project of these 700 lines using Python. However, a problem arises with my clusters: I am getting clusters full of French stopwords, and this is messing up the efficiency of my clusters.

My question is the following:

Is there any way to add French stopwords or manually update the built-in English stopword list so that I can get rid of these unnecessary words?

Here's the TfidfVectorizer code that contains my stopwords code:

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                             min_df=0.2, stop_words='english',
                             use_idf=True, tokenizer=tokenize_and_stem, 
ngram_range=(1,3))

The removal of these French stopwords will allow me to have clusters that are representative of the words that are recurring in my document.

For any doubt regarding the relevance of this question, I had asked a similar question last week. However, it is not similar as it does not use TfidfVectorizer.

Any help would be greatly appreciated. Thank you.

like image 402
OnThaRise Avatar asked Dec 17 '22 16:12

OnThaRise


2 Answers

You can use good stop words packages from NLTK or Spacy, two super popular NLP libraries for Python. Since achultz has already added the snippet for using stop-words library, I will show how to go about with NLTK or Spacy.

NLTK:

from nltk.corpus import stopwords

final_stopwords_list = stopwords.words('english') + stopwords.words('french')
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,
  max_features=200000,
  min_df=0.2,
  stop_words=final_stopwords_list,
  use_idf=True,
  tokenizer=tokenize_and_stem,
  ngram_range=(1,3))

NLTK will give you 334 stopwords in total.

Spacy:

from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop

final_stopwords_list = list(fr_stop) + list(en_stop)
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,
  max_features=200000,
  min_df=0.2,
  stop_words=final_stopwords_list,
  use_idf=True,
  tokenizer=tokenize_and_stem,
  ngram_range=(1,3))

Spacy gives you 890 stopwords in total.

like image 91
Ankur Sinha Avatar answered Dec 28 '22 09:12

Ankur Sinha


Igor Sharm noted ways to do things manually, but perhaps you could also install the stop-words package. Then, the since TfidfVectorizer allows a list as a stop_words parameter,

from stop_words import get_stop_words

my_stop_word_list = get_stop_words('english') + get_stop_words('french')

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                             min_df=0.2, stop_words=my_stop_word_list,
                             use_idf=True, tokenizer=tokenize_and_stem, 
ngram_range=(1,3))

You could also read and parse the french.txt file in the github project as needed, if you want to include only some words.

like image 23
aschultz Avatar answered Dec 28 '22 09:12

aschultz