Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up spaCy lemmatization?

I'm using spaCy (version 2.0.11) for lemmatization in the first step of my NLP pipeline but unfortunately it's taking a verrry long time. It is clearly the slowest part of my processing pipeline and I want to know if there are improvements I could be making. I am using a pipeline as:

nlp.pipe(docs_generator, batch_size=200, n_threads=6, disable=['ner'])

on a 8 core machine, and I have verified that the machine is using all the cores.

On a corpus of about 3 million short texts totaling almost 2gb it takes close to 24hrs to lemmatize and write to disk. Reasonable?

I have tried disabling a couple parts of the processing pipeline and found that it broke the lemmatization (parser, tagger).

Are there any parts of the default processing pipeline that are not required for lemmatization besides named entity recognition?

Are there other ways of speeding up the spaCy lemmatization process?

Aside:

It also appears that documentation doesn't list all the operations in the parsing pipeline. At the top of the spacy Language class we have:

factories = {
    'tokenizer': lambda nlp: nlp.Defaults.create_tokenizer(nlp),
    'tensorizer': lambda nlp, **cfg: Tensorizer(nlp.vocab, **cfg),
    'tagger': lambda nlp, **cfg: Tagger(nlp.vocab, **cfg),
    'parser': lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg),
    'ner': lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg),
    'similarity': lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg),
    'textcat': lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg),
    'sbd': lambda nlp, **cfg: SentenceSegmenter(nlp.vocab, **cfg),
    'sentencizer': lambda nlp, **cfg: SentenceSegmenter(nlp.vocab, **cfg),
    'merge_noun_chunks': lambda nlp, **cfg: merge_noun_chunks,
    'merge_entities': lambda nlp, **cfg: merge_entities
}

which includes some items not covered in the documentation here: https://spacy.io/usage/processing-pipelines

Since they are not covered I don't really know which may be disabled, nor what their dependencies are.

like image 948
TR517 Avatar asked Jul 17 '18 03:07

TR517


People also ask

What does NLP () do in spaCy?

NLP helps you extract insights from unstructured text and has several use cases, such as: Automatic summarization. Named entity recognition. Question answering systems.

Does spaCy have Stemming?

It might be surprising to you but spaCy doesn't contain any function for stemming as it relies on lemmatization only.


1 Answers

I found out you can disable the parser portion of the spacy pipeline as well, as long as you add the sentence segmenter. It's not crazy fast but it is definitely an improvement--in tests the time looks to be about 1/3 of what I was doing before (when I was just disabling 'ner'). Here is what I have now:

nlp = spacy.load('en', disable=['ner', 'parser'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))
like image 184
TR517 Avatar answered Oct 03 '22 00:10

TR517