The following question is about the Spacy NLP library for Python, but I would be surprised if the answer for other libraries differed substantially.
What is the maximum document size that Spacy can handle under reasonable memory conditions (e.g. a 4 GB VM in my case)? I had hoped to use Spacy to search for matches in book-size documents (100K+ tokens), but I'm repeatedly getting crashes that point to memory exhaustion as the cause.
I'm an NLP noob - I know the concepts academically, but I don't really know what to expect out of the state of the art libraries in practice. So I don't know if what I'm asking the library to do is ridiculously hard, or so easy that must be something I've screwed up in my environment.
As far as why I'm using an NLP library instead of something specifically oriented toward document search (e.g. solr), I'm using it because I would like to do lemma-based matching, rather than string-based.
While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.
spaCy is a free, open-source library for NLP in Python.
A Doc is a sequence of Token objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings. The Doc object holds an array of TokenC structs. The Python-level Token and Span objects are views of this array, i.e. they don't own the data themselves.
Spacy has a max_length limit of 1,000,000 characters. I was able to parse a document with 450,000 words just fine. The limit can be raised. I would split the text into n chunks depending upon total size.
The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the
nlp.max_length
limit. The limit is in number of characters, so you can check whether your inputs are too long by checkinglen(text)
.
https://github.com/explosion/spaCy/blob/master/spacy/errors.py
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With