All the examples that I see for using spacy just read in a single text file (that is small in size). How does one load a corpus of text files into spacy?
I can do this with textacy by pickling all the text in the corpus:
docs = textacy.io.spacy.read_spacy_docs('E:/spacy/DICKENS/dick.pkl', lang='en')
for doc in docs:
print(doc)
But I am not clear as to how to use this generator object (docs) for further analysis.
Also, I would rather use spacy, not textacy.
spacy also fails to read in a single file that is large (~ 2000000 characters).
Any help is appreciated...
Ravi
So I finally got this working, and it shall be preserved here for posterity.
Start with a generator, here named iterator
because I'm currently too afraid to change anything for fear of it breaking again:
def path_iterator(paths):
for p in paths:
print("yielding")
yield p.open("r").read(25)
Get an iterator, generator, or list of paths:
my_files = Path("/data/train").glob("*.txt")
This gets wrapped in our ... function from above, and passed to nlp.pipe
. In goes a generator, out comes a generator. The batch_size=5
is required here, or it will fall back into the bad habit of first reading all the files:
doc = nlp.pipe(path_iterator(my_paths), batch_size=5)
The important part, and reason why we're doing all this, is that until now nothing has happened. We're not waiting for a thousand files to be processed or anything. That happens only on demand, when you start reading from docs
:
for d in doc:
print("A document!")
You will see alternating blocks of five (our batch_size, above) "Yielding" and "A document". It's an actual pipeline now, and data starts coming out very soon after starting it.
And while I'm currently running a version one minor tick too old for this, the coup de grace is multi-processing:
# For those with these new AMD CPUs with hundreds of cores
doc = nlp.pipe(path_iterator(my_paths), batch_size=5, n_process=64)
You would just read the files in one at a time. This is what I usually do with my corpora files:
import glob
import spacy
nlp = spacy.load("en_core_web_sm")
path = 'your path here\\*.txt'
for file in glob.glob(path):
with open(file, encoding='utf-8', errors='ignore') as file_in:
text = file_in.read()
lines = text.split('\n')
for line in lines:
line = nlp(line)
for token in line:
print(token)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With