Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster Lemmatization techniques in Python

I am trying to find out a faster way to lemmatize words in a list (named text) using the NLTK Word Net Lemmatizer. Apparently this is the most time consuming step in my whole program(used cProfiler to find the same).

Following is the piece of code that I am trying to optimize for speed -

def lemmed(text):
    l = len(text)
    i = 0
    wnl = WordNetLemmatizer()
    while (i<l):
        text[i] = wnl.lemmatize(text[i])
        i = i + 1
    return text

Using the lemmatizer decreases my performance by 20x. Any help would be appreciated.

like image 580
Shivansh Singh Avatar asked Jun 24 '16 18:06

Shivansh Singh


People also ask

Which lemmatizer is best?

Wordnet Lemmatizer Wordnet is a publicly available lexical database of over 200 languages that provides semantic relationships between its words. It is one of the earliest and most commonly used lemmatizer technique.

What is Python Lemmatization?

Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word.

Does Lemmatization take more storage than stemming?

Stemming can be done very quickly. Lemmatization on the other hand, is the process of converting the given word into it's base form according to the dictionary meaning of the word. Lemmatization takes more time than stemming.

What is Lemmatization in NLP example?

In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words. For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.


1 Answers

If you have a few cores to spare, try using the multiprocessing library:

from nltk import WordNetLemmatizer
from multiprocessing import Pool

def lemmed(text, cores=6): # tweak cores as needed
    with Pool(processes=cores) as pool:
        wnl = WordNetLemmatizer()
        result = pool.map(wnl.lemmatize, text)
    return result


sample_text = ['tests', 'friends', 'hello'] * (10 ** 6)

lemmed_text = lemmed(sample_text)

assert len(sample_text) == len(lemmed_text) == (10 ** 6) * 3

print(lemmed_text[:3])
# => ['test', 'friend', 'hello']
like image 78
Alec Avatar answered Sep 22 '22 07:09

Alec