Spacy NLP library: what is maximum reasonable document size

Tags:

The following question is about the Spacy NLP library for Python, but I would be surprised if the answer for other libraries differed substantially.

What is the maximum document size that Spacy can handle under reasonable memory conditions (e.g. a 4 GB VM in my case)? I had hoped to use Spacy to search for matches in book-size documents (100K+ tokens), but I'm repeatedly getting crashes that point to memory exhaustion as the cause.

I'm an NLP noob - I know the concepts academically, but I don't really know what to expect out of the state of the art libraries in practice. So I don't know if what I'm asking the library to do is ridiculously hard, or so easy that must be something I've screwed up in my environment.

As far as why I'm using an NLP library instead of something specifically oriented toward document search (e.g. solr), I'm using it because I would like to do lemma-based matching, rather than string-based.

973

asked Jan 08 '18 03:01

Joe Bradley

1 Answers

Spacy has a max_length limit of 1,000,000 characters. I was able to parse a document with 450,000 words just fine. The limit can be raised. I would split the text into n chunks depending upon total size.

The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text).

https://github.com/explosion/spaCy/blob/master/spacy/errors.py

113

answered Oct 21 '22 15:10

Jeffrey Flynt

Related questions
                            
                                How to sort a pandas series of both index and values? [duplicate]
                            
                                Create a dataframe of permutations in pandas from list
                            
                                How to link to root page in intersphinx
                            
                                Where is my python-flask app source stored on ec2 instance deployed with elastic beanstalk?
                            
                                Difference between Kivy and Toga (Beeware project) for Cross platform in Python
                            
                                TypeError: a bytes-like object is required, not 'str' in subprocess.check_output
                            
                                ModuleNotFoundError: No module named 'import_export'
                            
                                Is it safe to call `setup()` multiple times in a single `setup.py`?
                            
                                Missing table name in IntegrityError (Django ORM)
                            
                                Is it possible to annotate a seaborn violin plot with number of observations in each group?
                            
                                pandas DataFrame to_sql Python
                            
                                pandas grouper vs time grouper
                            
                                Does Jupyter support 'read-only' notebooks?
                            
                                Cannot run tensorflow on GPU
                            
                                Remove empty partitions in Dask
                            
                                Dealing with large numbers in R [Inf] and Python
                            
                                AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis', using pandas eval
                            
                                How to get previous frame of a video in opencv python
                            
                                How do I generate a sine wave using Python?
                            
                                Using Sql Server with Django 2.0

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spacy NLP library: what is maximum reasonable document size

Tags:

python

nlp

spacy

Joe Bradley

People also ask

1 Answers

Jeffrey Flynt

Recent Activity

Donate For Us