Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to un-stem a word in Python?

Tags:

python

nlp

nltk

I want to know if there is anyway that I can un-stem them to a normal form?

The problem is that I have thousands of words in different forms e.g. eat, eaten, ate, eating and so on and I need to count the frequency of each word. All of these - eat, eaten, ate, eating etc will count towards eat and hence, I used stemming.

But the next part of the problem requires me to find similar words in data and I am using nltk's synsets to calculate Wu-Palmer Similarity among the words. The problem is that nltk's synsets wont work on stemmed words, or at least in this code they won't. check if two words are related to each other

How should I do it? Is there a way to un-stem a word?

like image 390
silent_dev Avatar asked May 15 '15 18:05

silent_dev


People also ask

How do you Lemmatize words in Python?

In order to lemmatize, you need to create an instance of the WordNetLemmatizer() and call the lemmatize() function on a single word. Let's lemmatize a simple sentence. We first tokenize the sentence into words using nltk. word_tokenize and then we will call lemmatizer.

Should I stem or Lemmatize?

Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. For instance, stemming the word 'Caring' would return 'Car'.

Does Lemmatization take more storage than stemming?

Stemming can be done very quickly. Lemmatization on the other hand, is the process of converting the given word into it's base form according to the dictionary meaning of the word. Lemmatization takes more time than stemming.


1 Answers

I think an ok approach is something like said in https://stackoverflow.com/a/30670993/7127519.

A possible implementations could be something like this:

import re
import string
import nltk
import pandas as pd
stemmer = nltk.stem.porter.PorterStemmer()

An stemmer to use. Here a text to use:

complete_text = ''' cats catlike catty cat 
stemmer stemming stemmed stem 
fishing fished fisher fish 
argue argued argues arguing argus argu 
argument arguments argument '''

Create a list with the different words:

my_list = []
#for i in complete_text.decode().split():
try: 
    aux = complete_text.decode().split()
except:
    aux = complete_text.split()
for i in aux:
    if i not in my_list:
        my_list.append(i.lower())
my_list

with output:

['cats',
 'catlike',
 'catty',
 'cat',
 'stemmer',
 'stemming',
 'stemmed',
 'stem',
 'fishing',
 'fished',
 'fisher',
 'fish',
 'argue',
 'argued',
 'argues',
 'arguing',
 'argus',
 'argu',
 'argument',
 'arguments']

An now create the dictionary:

aux = pd.DataFrame(my_list, columns =['word'] )
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x))
aux = aux.groupby('word_stemmed').transform(lambda x: ', '.join(x))
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x.split(',')[0]))
aux.index = aux['word_stemmed']
del aux['word_stemmed']
my_dict = aux.to_dict('dict')['word']
my_dict

Which output is:

{'argu': 'argue, argued, argues, arguing, argus, argu',
 'argument': 'argument, arguments',
 'cat': 'cats, cat',
 'catlik': 'catlike',
 'catti': 'catty',
 'fish': 'fishing, fished, fish',
 'fisher': 'fisher',
 'stem': 'stemming, stemmed, stem',
 'stemmer': 'stemmer'}

Companion notebook here.

like image 74
Rafael Valero Avatar answered Oct 13 '22 20:10

Rafael Valero