ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. spacy

Question

I'm trying to create a corpus of words by a text. I use spacy. So there is my code:

import spacy
nlp = spacy.load('fr_core_news_md')
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
    if token.lemma_ not in words:
        words.append(token.lemma_)

f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "
" + ''.join([i + "
" for i in sorted(words)]))
f.close()

But it returns this exception:

ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

I tried somthing like this:

import spacy
nlp = spacy.load('fr_core_news_md')
nlp.max_length = 1027203
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
    if token.lemma_ not in words:
        words.append(token.lemma_)

f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "
" + ''.join([i + "
" for i in sorted(words)]))
f.close()

But got the same error:

ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

How to fix it?

Rahul P · Accepted Answer

I differ from the answer above and I think nlp.max_length did execute correctly but the value set is too low. It looks like you have set it to exactly the value in the error message.Increase the nlp.max_length to a little over the number in the error message:

nlp.max_length = 1030000 # or even higher

It should ideally work after this.

So your code could be changed to this

import spacy
nlp = spacy.load('fr_core_news_md')
nlp.max_length = 1030000 # or higher
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
    if token.lemma_ not in words:
        words.append(token.lemma_)

f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "
" + ''.join([i + "
" for i in sorted(words)]))
f.close()

Karthick Durai · Answer

I faced the same issue, I had to loop over a directory of text files and perform NER on the text files to extract entities present in them.

for file in folder_text_files:
    with open(file, 'r', errors="ignore") as f:
         text = f.read()
         f.close()
    nlp.max_length = len(text) + 100

So doing this might help you worrying about the text size

Ali Cirik · Answer

It looks like nlp.max_length = 1027203 code in the second example did not execute correctly.

Alternatively, if your text file has multiple lines, you can create your doc for each line in the file. Something like the following:

for line in f.read().split('
'):
    doc = nlp(''.join(ch for ch in line if ch.isalnum() or ch == " "))
    ...

ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. spacy

Tags:

python

python-3.x

nlp

spacy

USERNAME GOES HERE

3 Answers

Rahul P

Karthick Durai

Ali Cirik

Recent Activity

Donate For Us

ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. spacy

Tags:

python

python-3.x

nlp

spacy

USERNAME GOES HERE

3 Answers

Rahul P

Karthick Durai

Ali Cirik

Related questions

Recent Activity

Donate For Us