Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. spacy

I'm trying to create a corpus of words by a text. I use spacy. So there is my code:

import spacy
nlp = spacy.load('fr_core_news_md')
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
    if token.lemma_ not in words:
        words.append(token.lemma_)

f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "\n" + ''.join([i + "\n" for i in sorted(words)]))
f.close()

But it returns this exception:

ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

I tried somthing like this:

import spacy
nlp = spacy.load('fr_core_news_md')
nlp.max_length = 1027203
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
    if token.lemma_ not in words:
        words.append(token.lemma_)

f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "\n" + ''.join([i + "\n" for i in sorted(words)]))
f.close()

But got the same error:

ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

How to fix it?

like image 357
USERNAME GOES HERE Avatar asked Jul 27 '19 11:07

USERNAME GOES HERE


3 Answers

I differ from the answer above and I think nlp.max_length did execute correctly but the value set is too low. It looks like you have set it to exactly the value in the error message.Increase the nlp.max_length to a little over the number in the error message:

nlp.max_length = 1030000 # or even higher

It should ideally work after this.

So your code could be changed to this

import spacy
nlp = spacy.load('fr_core_news_md')
nlp.max_length = 1030000 # or higher
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
    if token.lemma_ not in words:
        words.append(token.lemma_)

f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "\n" + ''.join([i + "\n" for i in sorted(words)]))
f.close()
like image 55
Rahul P Avatar answered Oct 10 '22 11:10

Rahul P


I faced the same issue, I had to loop over a directory of text files and perform NER on the text files to extract entities present in them.

for file in folder_text_files:
    with open(file, 'r', errors="ignore") as f:
         text = f.read()
         f.close()
    nlp.max_length = len(text) + 100

So doing this might help you worrying about the text size

like image 24
Karthick Durai Avatar answered Oct 10 '22 12:10

Karthick Durai


It looks like nlp.max_length = 1027203 code in the second example did not execute correctly.

Alternatively, if your text file has multiple lines, you can create your doc for each line in the file. Something like the following:

for line in f.read().split('\n'):
    doc = nlp(''.join(ch for ch in line if ch.isalnum() or ch == " "))
    ...
like image 1
Ali Cirik Avatar answered Oct 10 '22 10:10

Ali Cirik