Logo Questions Linux Laravel Mysql Ubuntu Git Menu

how to use spacy lemmatizer to get a word into basic form

I am new to spacy and I want to use its lemmatizer function, but I don't know how to use it, like I into strings of word, which will return the string with the basic form the words.


  • 'words'=> 'word'
  • 'did' => 'do'

Thank you.

like image 264
yi wang Avatar asked Aug 04 '16 09:08

yi wang

People also ask

How do you Tokenize words in spaCy?

In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right. First, the tokenizer split the text on whitespace similar to the split() function. Then the tokenizer checks whether the substring matches the tokenizer exception rules.

How do you use Wordnet Lemmatizer?

In order to lemmatize, you need to create an instance of the WordNetLemmatizer() and call the lemmatize() function on a single word. Let's lemmatize a simple sentence. We first tokenize the sentence into words using nltk. word_tokenize and then we will call lemmatizer.

Does spaCy have Stemming?

It might be surprising to you but spaCy doesn't contain any function for stemming as it relies on lemmatization only.

3 Answers

Previous answer is convoluted and can't be edited, so here's a more conventional one.

# make sure your downloaded the english model with "python -m spacy download en"  import spacy nlp = spacy.load('en')  doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")  for token in doc:     print(token, token.lemma, token.lemma_) 


Apples 6617 apples and 512 and oranges 7024 orange are 536 be similar 1447 similar . 453 . Boots 4622 boot and 512 and hippos 98365 hippo are 536 be n't 538 not . 453 . 

From the official Lighting tour

like image 196
damio Avatar answered Sep 23 '22 03:09


If you want to use just the Lemmatizer, you can do that in the following way:

from spacy.lemmatizer import Lemmatizer from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES  lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES) lemmas = lemmatizer(u'ducks', u'NOUN') print(lemmas) 




Since spacy version 2.2, LEMMA_INDEX, LEMMA_EXC, and LEMMA_RULES have been bundled into a Lookups Object:

import spacy nlp = spacy.load('en')  nlp.vocab.lookups >>> <spacy.lookups.Lookups object at 0x7f89a59ea810> nlp.vocab.lookups.tables >>> ['lemma_lookup', 'lemma_rules', 'lemma_index', 'lemma_exc'] 

You can still use the lemmatizer directly with a word and a POS (part of speech) tag:

from spacy.lemmatizer import Lemmatizer, ADJ, NOUN, VERB  lemmatizer = nlp.vocab.morphology.lemmatizer lemmatizer('ducks', NOUN) >>> ['duck'] 

You can pass the POS tag as the imported constant like above or as string:

lemmatizer('ducks', 'NOUN') >>> ['duck'] 

from spacy.lemmatizer import Lemmatizer, ADJ, NOUN, VERB

like image 21
joel Avatar answered Sep 23 '22 03:09


Code :

import os
from spacy.en import English, LOCAL_DATA_DIR

data_dir = os.environ.get('SPACY_DATA', LOCAL_DATA_DIR)

nlp = English(data_dir=data_dir)

doc3 = nlp(u"this is spacy lemmatize testing. programming books are more better than others")

for token in doc3:
    print token, token.lemma, token.lemma_

Output :

this 496 this
is 488 be
spacy 173779 spacy
lemmatize 1510965 lemmatize
testing 2900 testing
. 419 .
programming 3408 programming
books 1011 book
are 488 be
more 529 more
better 615 better
than 555 than
others 871 others

Example Ref: here

like image 43
RAVI Avatar answered Sep 23 '22 03:09