Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spacy NLP library: what is maximum reasonable document size

Tags:

python

nlp

spacy

The following question is about the Spacy NLP library for Python, but I would be surprised if the answer for other libraries differed substantially.

What is the maximum document size that Spacy can handle under reasonable memory conditions (e.g. a 4 GB VM in my case)? I had hoped to use Spacy to search for matches in book-size documents (100K+ tokens), but I'm repeatedly getting crashes that point to memory exhaustion as the cause.

I'm an NLP noob - I know the concepts academically, but I don't really know what to expect out of the state of the art libraries in practice. So I don't know if what I'm asking the library to do is ridiculously hard, or so easy that must be something I've screwed up in my environment.

As far as why I'm using an NLP library instead of something specifically oriented toward document search (e.g. solr), I'm using it because I would like to do lemma-based matching, rather than string-based.

like image 973
Joe Bradley Avatar asked Jan 08 '18 03:01

Joe Bradley


People also ask

Is spaCy better than NLTK?

While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.

Is spaCy NLP free?

spaCy is a free, open-source library for NLP in Python.

What is Doc in spaCy?

A Doc is a sequence of Token objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings. The Doc object holds an array of TokenC structs. The Python-level Token and Span objects are views of this array, i.e. they don't own the data themselves.


1 Answers

Spacy has a max_length limit of 1,000,000 characters. I was able to parse a document with 450,000 words just fine. The limit can be raised. I would split the text into n chunks depending upon total size.

The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text).

https://github.com/explosion/spaCy/blob/master/spacy/errors.py

like image 113
Jeffrey Flynt Avatar answered Oct 21 '22 15:10

Jeffrey Flynt