Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

read corpus of text files in spacy

All the examples that I see for using spacy just read in a single text file (that is small in size). How does one load a corpus of text files into spacy?

I can do this with textacy by pickling all the text in the corpus:

docs =  textacy.io.spacy.read_spacy_docs('E:/spacy/DICKENS/dick.pkl', lang='en')

for doc in docs:
    print(doc)

But I am not clear as to how to use this generator object (docs) for further analysis.

Also, I would rather use spacy, not textacy.

spacy also fails to read in a single file that is large (~ 2000000 characters).

Any help is appreciated...

Ravi

like image 261
Ravi Avatar asked Sep 19 '25 04:09

Ravi


2 Answers

So I finally got this working, and it shall be preserved here for posterity.

Start with a generator, here named iterator because I'm currently too afraid to change anything for fear of it breaking again:

def path_iterator(paths):
    for p in paths:
        print("yielding")
        yield p.open("r").read(25)

Get an iterator, generator, or list of paths:

my_files = Path("/data/train").glob("*.txt")

This gets wrapped in our ... function from above, and passed to nlp.pipe. In goes a generator, out comes a generator. The batch_size=5 is required here, or it will fall back into the bad habit of first reading all the files:

doc = nlp.pipe(path_iterator(my_paths), batch_size=5)

The important part, and reason why we're doing all this, is that until now nothing has happened. We're not waiting for a thousand files to be processed or anything. That happens only on demand, when you start reading from docs:

for d in doc:
    print("A document!")

You will see alternating blocks of five (our batch_size, above) "Yielding" and "A document". It's an actual pipeline now, and data starts coming out very soon after starting it.

And while I'm currently running a version one minor tick too old for this, the coup de grace is multi-processing:

# For those with these new AMD CPUs with hundreds of cores
doc = nlp.pipe(path_iterator(my_paths), batch_size=5, n_process=64) 
like image 122
Matthias Winkelmann Avatar answered Sep 21 '25 13:09

Matthias Winkelmann


You would just read the files in one at a time. This is what I usually do with my corpora files:

import glob
import spacy
nlp = spacy.load("en_core_web_sm")
path = 'your path here\\*.txt'

for file in glob.glob(path):
    with open(file, encoding='utf-8', errors='ignore') as file_in:
        text = file_in.read()
        lines = text.split('\n')
        for line in lines:
            line = nlp(line)
            for token in line:
                print(token)
like image 20
Nester Avatar answered Sep 21 '25 13:09

Nester