I'm trying to create a corpus of words by a text. I use spacy. So there is my code:
import spacy
nlp = spacy.load('fr_core_news_md')
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
if token.lemma_ not in words:
words.append(token.lemma_)
f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "\n" + ''.join([i + "\n" for i in sorted(words)]))
f.close()
But it returns this exception:
ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.
I tried somthing like this:
import spacy
nlp = spacy.load('fr_core_news_md')
nlp.max_length = 1027203
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
if token.lemma_ not in words:
words.append(token.lemma_)
f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "\n" + ''.join([i + "\n" for i in sorted(words)]))
f.close()
But got the same error:
ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.
How to fix it?
I differ from the answer above and I think nlp.max_length did execute correctly but the value set is too low. It looks like you have set it to exactly the value in the error message.Increase the nlp.max_length to a little over the number in the error message:
nlp.max_length = 1030000 # or even higher
It should ideally work after this.
So your code could be changed to this
import spacy
nlp = spacy.load('fr_core_news_md')
nlp.max_length = 1030000 # or higher
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
if token.lemma_ not in words:
words.append(token.lemma_)
f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "\n" + ''.join([i + "\n" for i in sorted(words)]))
f.close()
I faced the same issue, I had to loop over a directory of text files and perform NER on the text files to extract entities present in them.
for file in folder_text_files:
with open(file, 'r', errors="ignore") as f:
text = f.read()
f.close()
nlp.max_length = len(text) + 100
So doing this might help you worrying about the text size
It looks like nlp.max_length = 1027203
code in the second example did not execute correctly.
Alternatively, if your text file has multiple lines, you can create your doc
for each line in the file. Something like the following:
for line in f.read().split('\n'):
doc = nlp(''.join(ch for ch in line if ch.isalnum() or ch == " "))
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With