Faster Lemmatization techniques in Python

Tags:

I am trying to find out a faster way to lemmatize words in a list (named text) using the NLTK Word Net Lemmatizer. Apparently this is the most time consuming step in my whole program(used cProfiler to find the same).

Following is the piece of code that I am trying to optimize for speed -

def lemmed(text):
    l = len(text)
    i = 0
    wnl = WordNetLemmatizer()
    while (i<l):
        text[i] = wnl.lemmatize(text[i])
        i = i + 1
    return text

Using the lemmatizer decreases my performance by 20x. Any help would be appreciated.

580

asked Jun 24 '16 18:06

Shivansh Singh

1 Answers

If you have a few cores to spare, try using the multiprocessing library:

from nltk import WordNetLemmatizer
from multiprocessing import Pool

def lemmed(text, cores=6): # tweak cores as needed
    with Pool(processes=cores) as pool:
        wnl = WordNetLemmatizer()
        result = pool.map(wnl.lemmatize, text)
    return result


sample_text = ['tests', 'friends', 'hello'] * (10 ** 6)

lemmed_text = lemmed(sample_text)

assert len(sample_text) == len(lemmed_text) == (10 ** 6) * 3

print(lemmed_text[:3])
# => ['test', 'friend', 'hello']

answered Sep 22 '22 07:09

Alec

Related questions
                            
                                Is it possible to loop through Amazon S3 bucket and count the number of lines in its file/key using Python?
                            
                                Tasks being repeated in Celery
                            
                                Subtracting Two Columns with a Groupby in Pandas
                            
                                Add text annotation to matplotlib plot from a pandas dataframe
                            
                                Python - Speed up for converting a categorical variable to it's numerical index
                            
                                Is there a function to return all single letter colors in Matplotlib?
                            
                                Numpy einsum broadcasting
                            
                                Upgrading from Django 1.6 to 1.9: python manage.py migrate failure
                            
                                How can I merge two dataframes with 'wildcards'?
                            
                                Tox can't copy non-python file while installing the module
                            
                                takes 1 positional argument but 2 were given
                            
                                AttributeError: 'module' object has no attribute 'webdriver'
                            
                                Plot a Correlation Circle in Python
                            
                                How use `unaccent` with full text search in django 1.10?
                            
                                Pandas MultiIndex groupby retaining index levels
                            
                                Python intersection with custom equality
                            
                                'ManyToManyDescriptor' object has no attribute 'add', why?
                            
                                Crash when calling PyArg_ParseTuple on a Numpy array
                            
                                merging recurrent layers with dense layer in Keras
                            
                                How to get __init__() to raise a more useful exception instead of TypeError when incorrect # of arguments?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Faster Lemmatization techniques in Python

Tags:

performance

python

python-3.x

nltk

lemmatization

Shivansh Singh

People also ask

1 Answers

Alec

Recent Activity

Donate For Us