I am using pyspark.ml.feature.StopWordsRemover class on my pyspark dataframe. It has ID and Text column. In addition to default stop word list provided, I would like to add my own custom list to remove all numeric values from string.
I can see there is a method provided to add setStopWords for this class. I think I'm struggling with the proper syntax to use this method.
from pyspark.sql.functions import *
from pyspark.ml.feature import *
a = StopWordsRemover(inputCol="words", outputCol="filtered")
b = a.transform(df)
The above code gives me expected results in the filtered column but it only removes / stops standard words. I'm looking for a method to add my own custom list which would have more words and numeric values that I wish to filter.
By default, Spacy has 326 English stopwords, but at times you may like to add your own custom stopwords to the default list. We will show you how in the below example. To add a custom stopword in Spacy, we first load its English language model and use add() method to add stopwords.
We are using “|” symbol to add these 2 Stop Words because in python | Symbol acts as a Union Set Operator. Means, If these 2 words are not present in the list then and only then they will be added to stop words list otherwise they will be discarded.
You need to separate your word lists. One should be for single words and another should be for phrases. And then you need to convert copy_phrase_list to a string and return it. Remove all your for loops and add the following for loop.
The above solution replaces the original list of stop words with the list we supplied. If you want to add your own stopwords in addition to the existing/predefined stopwords, then we need to append the list with the original list before passing into StopWordsRemover () as a parameter. We transform to set to remove any duplicate.
Sum the term frequencies of each unique word, w across all documents in your collection. Sort the terms in descending order of raw term frequency. You can take the top N terms to be your stop words.
If you rank each ti in your collection by its IDF score in descending order, you can treat the bottom K terms with the lowest IDF scores to be your stop words. Again, you can also eliminate common English words (using a published stop list) prior to sorting so that you are sure that you target the domain specific low IDF words.
You can take the top N terms to be your stop words. You can also eliminate common English words (using a publish stop list) prior to sorting so that you are sure that you target the domain specific stop words. Another option is to treat words occurring in more X% of your documents as stop words.
You can specify it with this :
stopwordList = ["word1","word2","word3"]
StopWordsRemover(inputCol="words", outputCol="filtered" ,stopWords=stopwordList)
The above solution replaces the original list of stop words with the list we supplied.
If you want to add your own stopwords in addition to the existing/predefined stopwords, then we need to append the list with the original list before passing into StopWordsRemover() as a parameter. We transform to set to remove any duplicate.
stopwordList = ["word1","word2","word3"]
stopwordList.extend(StopWordsRemover().getStopWords())
stopwordList = list(set(stopwordList))#optionnal
StopWordsRemover(inputCol="words", outputCol="filtered" ,stopWords=stopwordList)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With