How to add custom stop word list to StopWordsRemover

Tags:

I am using pyspark.ml.feature.StopWordsRemover class on my pyspark dataframe. It has ID and Text column. In addition to default stop word list provided, I would like to add my own custom list to remove all numeric values from string.

I can see there is a method provided to add setStopWords for this class. I think I'm struggling with the proper syntax to use this method.

from pyspark.sql.functions import *
from pyspark.ml.feature import * 

a = StopWordsRemover(inputCol="words", outputCol="filtered")
b = a.transform(df)

The above code gives me expected results in the filtered column but it only removes / stops standard words. I'm looking for a method to add my own custom list which would have more words and numeric values that I wish to filter.

814

asked Apr 26 '17 01:04

user2763088

1 Answers

You can specify it with this :

stopwordList = ["word1","word2","word3"]

StopWordsRemover(inputCol="words", outputCol="filtered" ,stopWords=stopwordList)

A small Note:

The above solution replaces the original list of stop words with the list we supplied.
If you want to add your own stopwords in addition to the existing/predefined stopwords, then we need to append the list with the original list before passing into StopWordsRemover() as a parameter. We transform to set to remove any duplicate.

stopwordList = ["word1","word2","word3"] stopwordList.extend(StopWordsRemover().getStopWords())
stopwordList = list(set(stopwordList))#optionnal
StopWordsRemover(inputCol="words", outputCol="filtered" ,stopWords=stopwordList)

158

answered Oct 21 '22 01:10

ML_TN

Related questions
                            
                                Conditionally add items to a list when defining the list?
                            
                                Django REST Framework and python-social-auth for registration/login user
                            
                                How to get away with a multidimensional index in pandas
                            
                                how to slice a pandas data frame according to column values?
                            
                                How to compute standard error from ODR results?
                            
                                pytest: How to force raising Exceptions during unit-testing?
                            
                                Django Celery task on Heroku causes high memory usage
                            
                                How do I use rasterio/python to mask a raster using a shapefile, to set the raster pixels inside the polygons to zero?
                            
                                Pandas TimeGrouper on multiindex
                            
                                Boto3 InvalidParameterException
                            
                                Redistributing excess values in numpy 2D array
                            
                                Overwrite django choices output in graphene
                            
                                Synchronous sleep into asyncio coroutine
                            
                                iterate over pyspark dataframe columns
                            
                                ImportError: No module named datasets
                            
                                How can I locate something on my screen quickly in Python?
                            
                                Calling Cython functions from Numba jitted code
                            
                                Why do people say "Don't use place()"?
                            
                                Print a postgresql table to standard output in python
                            
                                format value that could be number and/or string in python 3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to add custom stop word list to StopWordsRemover

Tags:

python

text-mining

pyspark

spark-dataframe

stop-words

user2763088

People also ask

1 Answers

A small Note:

ML_TN

Recent Activity

Donate For Us