How to speed up spaCy lemmatization?

Tags:

I'm using spaCy (version 2.0.11) for lemmatization in the first step of my NLP pipeline but unfortunately it's taking a verrry long time. It is clearly the slowest part of my processing pipeline and I want to know if there are improvements I could be making. I am using a pipeline as:

nlp.pipe(docs_generator, batch_size=200, n_threads=6, disable=['ner'])

on a 8 core machine, and I have verified that the machine is using all the cores.

On a corpus of about 3 million short texts totaling almost 2gb it takes close to 24hrs to lemmatize and write to disk. Reasonable?

I have tried disabling a couple parts of the processing pipeline and found that it broke the lemmatization (parser, tagger).

Are there any parts of the default processing pipeline that are not required for lemmatization besides named entity recognition?

Are there other ways of speeding up the spaCy lemmatization process?

Aside:

It also appears that documentation doesn't list all the operations in the parsing pipeline. At the top of the spacy Language class we have:

factories = {
    'tokenizer': lambda nlp: nlp.Defaults.create_tokenizer(nlp),
    'tensorizer': lambda nlp, **cfg: Tensorizer(nlp.vocab, **cfg),
    'tagger': lambda nlp, **cfg: Tagger(nlp.vocab, **cfg),
    'parser': lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg),
    'ner': lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg),
    'similarity': lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg),
    'textcat': lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg),
    'sbd': lambda nlp, **cfg: SentenceSegmenter(nlp.vocab, **cfg),
    'sentencizer': lambda nlp, **cfg: SentenceSegmenter(nlp.vocab, **cfg),
    'merge_noun_chunks': lambda nlp, **cfg: merge_noun_chunks,
    'merge_entities': lambda nlp, **cfg: merge_entities
}

which includes some items not covered in the documentation here: https://spacy.io/usage/processing-pipelines

Since they are not covered I don't really know which may be disabled, nor what their dependencies are.

948

asked Jul 17 '18 03:07

TR517

1 Answers

I found out you can disable the parser portion of the spacy pipeline as well, as long as you add the sentence segmenter. It's not crazy fast but it is definitely an improvement--in tests the time looks to be about 1/3 of what I was doing before (when I was just disabling 'ner'). Here is what I have now:

nlp = spacy.load('en', disable=['ner', 'parser'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))

184

answered Oct 03 '22 00:10

TR517

Related questions
                            
                                Table design for user's information as well as login credentials?
                            
                                How to perform bit shift without ("<<" || ">>") operator efficiently?
                            
                                EJB Vs WebService? Performance point of view
                            
                                Java fast pixel operations
                            
                                32-bit pointers with the x86-64 ISA: why not?
                            
                                JS: How long does it take to call a function?
                            
                                iOS Performance Tuning: fastest way to get pixel color for large images
                            
                                OpenGL tile rendering: most efficient way?
                            
                                Is there a performance difference between i++ and ++i in JavaScript? [closed]
                            
                                7-second EF startup time even for tiny DbContext
                            
                                Shell script vs C performance
                            
                                new BigInteger(String) performance / complexity
                            
                                What std::_lockit does?
                            
                                When to use pairlists in R?
                            
                                Change to HashMap hash function in Java 8
                            
                                Does making relationships in Database make them slow
                            
                                Why is the hash table resized by doubling it?
                            
                                Why docker build is so slow
                            
                                How to speed up Elasticsearch recovery?
                            
                                Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to speed up spaCy lemmatization?

Tags:

performance

nlp

spacy

TR517

People also ask

1 Answers

TR517

Recent Activity

Donate For Us