In NLP, stop-words removal is a typical pre-processing step. And it is typically done in an empirical way based on what we think stop-words should be.
But in my opinion, we should generalize the concept of stop-words. And the stop-words could vary for corpora from different domains. I am wondering if we can define the stop-words mathematically, such as by its statistical characteristics. And then can we automatically extract stop-words from a corpora for a specific domain.
Is there any similar thought and progress on this? Could anyone shed some light?
I am not an expert, but hope my answer makes sense.
Statistically extracting stop words from a corpus sounds interesting! I would consider calculating inverse document frequency, as mentioned in the other answers, apart from using regular stop words from a common stop-word list, like the one in NLTK. Stop words not only vary from corpora to corpora, they may also vary from problem to problem. For example, in one of the problems I was working, I was using a corpus of news articles, where you find a lot of time-sensitive and location-sensitive words. These were crucial information, and statistically removing words like "today", "here", etc. would have affected my results dearly. Because, news articles talk about not just one particular event, but also similar events that had happened in the past or in another location.
My point, in short, is that you would need to consider the problem being addressed as well, and not just the corpus.
Thanks, Ramya
Usually the stop-words occurs much more frequently than the other semantic words...So while building my application I used the combination of both; a fixed list and the statistical method. I was using NLTK and it already had a list of some common stop words; so I first removed the words which appears in this list, but of-course this didn't removed all the stop-words...As you already mentioned that the stop words differs from corpora to corpora. Then I evaluated the frequency of each word appearing in the corpora and removed the words which have a frequency above a "certain limit". This certain limit which I mention, was the value I fixed after observing the frequency of all the words...hence again this limit also depends on corpora to corpora...but you can easily calculate this once you carefully observe the list of all the words in order of their frequency...This statistical method will ensure that you are removing the Stop-Words which do not appears in list of common stop-words...After that to refine the data I also used POS tagging...and removed the proper nouns which still exist after the first two steps..
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With