I'm using spaCy (version 2.0.11) for lemmatization in the first step of my NLP pipeline but unfortunately it's taking a verrry long time. It is clearly the slowest part of my processing pipeline and I want to know if there are improvements I could be making. I am using a pipeline as:
nlp.pipe(docs_generator, batch_size=200, n_threads=6, disable=['ner'])
on a 8 core machine, and I have verified that the machine is using all the cores.
On a corpus of about 3 million short texts totaling almost 2gb it takes close to 24hrs to lemmatize and write to disk. Reasonable?
I have tried disabling a couple parts of the processing pipeline and found that it broke the lemmatization (parser, tagger).
Are there any parts of the default processing pipeline that are not required for lemmatization besides named entity recognition?
Are there other ways of speeding up the spaCy lemmatization process?
Aside:
It also appears that documentation doesn't list all the operations in the parsing pipeline. At the top of the spacy Language class we have:
factories = {
'tokenizer': lambda nlp: nlp.Defaults.create_tokenizer(nlp),
'tensorizer': lambda nlp, **cfg: Tensorizer(nlp.vocab, **cfg),
'tagger': lambda nlp, **cfg: Tagger(nlp.vocab, **cfg),
'parser': lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg),
'ner': lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg),
'similarity': lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg),
'textcat': lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg),
'sbd': lambda nlp, **cfg: SentenceSegmenter(nlp.vocab, **cfg),
'sentencizer': lambda nlp, **cfg: SentenceSegmenter(nlp.vocab, **cfg),
'merge_noun_chunks': lambda nlp, **cfg: merge_noun_chunks,
'merge_entities': lambda nlp, **cfg: merge_entities
}
which includes some items not covered in the documentation here: https://spacy.io/usage/processing-pipelines
Since they are not covered I don't really know which may be disabled, nor what their dependencies are.
NLP helps you extract insights from unstructured text and has several use cases, such as: Automatic summarization. Named entity recognition. Question answering systems.
It might be surprising to you but spaCy doesn't contain any function for stemming as it relies on lemmatization only.
I found out you can disable the parser portion of the spacy pipeline as well, as long as you add the sentence segmenter. It's not crazy fast but it is definitely an improvement--in tests the time looks to be about 1/3 of what I was doing before (when I was just disabling 'ner'). Here is what I have now:
nlp = spacy.load('en', disable=['ner', 'parser'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With