Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lemmatize a doc with spacy?

I have a spaCy doc that I would like to lemmatize.

For example:

import spacy
nlp = spacy.load('en_core_web_lg')

my_str = 'Python is the greatest language in the world'
doc = nlp(my_str)

How can I convert every token in the doc to its lemma?

like image 572
max Avatar asked Aug 02 '18 16:08

max


People also ask

What is Doc in spaCy?

A Doc is a sequence of Token objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings. The Doc object holds an array of TokenC structs. The Python-level Token and Span objects are views of this array, i.e. they don't own the data themselves.

What is the difference between stemming and Lemmatization?

Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.


1 Answers

Each token has a number of attributes, you can iterate through the doc to access them.

For example: [token.lemma_ for token in doc]

If you want to reconstruct the sentence you could use: ' '.join([token.lemma_ for token in doc])

For a full list of token attributes see: https://spacy.io/api/token#attributes

like image 89
ame Avatar answered Sep 21 '22 07:09

ame