Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add custom stop word list to StopWordsRemover

I am using pyspark.ml.feature.StopWordsRemover class on my pyspark dataframe. It has ID and Text column. In addition to default stop word list provided, I would like to add my own custom list to remove all numeric values from string.

I can see there is a method provided to add setStopWords for this class. I think I'm struggling with the proper syntax to use this method.

from pyspark.sql.functions import *
from pyspark.ml.feature import * 

a = StopWordsRemover(inputCol="words", outputCol="filtered")
b = a.transform(df)

The above code gives me expected results in the filtered column but it only removes / stops standard words. I'm looking for a method to add my own custom list which would have more words and numeric values that I wish to filter.

like image 814
user2763088 Avatar asked Apr 26 '17 01:04

user2763088


People also ask

How do I add custom stop words to Spacy?

By default, Spacy has 326 English stopwords, but at times you may like to add your own custom stopwords to the default list. We will show you how in the below example. To add a custom stopword in Spacy, we first load its English language model and use add() method to add stopwords.

How do you add stop words list in Python?

We are using “|” symbol to add these 2 Stop Words because in python | Symbol acts as a Union Set Operator. Means, If these 2 words are not present in the list then and only then they will be added to stop words list otherwise they will be discarded.

How do I get rid of custom stop words?

You need to separate your word lists. One should be for single words and another should be for phrases. And then you need to convert copy_phrase_list to a string and return it. Remove all your for loops and add the following for loop.

How do I add my own stopwords to the stopwordsremover?

The above solution replaces the original list of stop words with the list we supplied. If you want to add your own stopwords in addition to the existing/predefined stopwords, then we need to append the list with the original list before passing into StopWordsRemover () as a parameter. We transform to set to remove any duplicate.

How do I create stop words?

Sum the term frequencies of each unique word, w across all documents in your collection. Sort the terms in descending order of raw term frequency. You can take the top N terms to be your stop words.

How do I Choose stop words for my collection?

If you rank each ti in your collection by its IDF score in descending order, you can treat the bottom K terms with the lowest IDF scores to be your stop words. Again, you can also eliminate common English words (using a published stop list) prior to sorting so that you are sure that you target the domain specific low IDF words.

How do I sort my stop words?

You can take the top N terms to be your stop words. You can also eliminate common English words (using a publish stop list) prior to sorting so that you are sure that you target the domain specific stop words. Another option is to treat words occurring in more X% of your documents as stop words.


1 Answers

You can specify it with this :

stopwordList = ["word1","word2","word3"]

StopWordsRemover(inputCol="words", outputCol="filtered" ,stopWords=stopwordList)

A small Note:

The above solution replaces the original list of stop words with the list we supplied.
If you want to add your own stopwords in addition to the existing/predefined stopwords, then we need to append the list with the original list before passing into StopWordsRemover() as a parameter. We transform to set to remove any duplicate.

stopwordList = ["word1","word2","word3"] stopwordList.extend(StopWordsRemover().getStopWords())
stopwordList = list(set(stopwordList))#optionnal
StopWordsRemover(inputCol="words", outputCol="filtered" ,stopWords=stopwordList)

like image 158
ML_TN Avatar answered Oct 21 '22 01:10

ML_TN