I am using Spacy for NLP in Python. I am trying to use nlp.pipe()
to generate a list of Spacy doc objects, which I can then analyze. Oddly enough, nlp.pipe()
returns an object of the class <generator object pipe at 0x7f28640fefa0>
. How can I get it to return a list of docs, as intended?
import Spacy
nlp = spacy.load('en_depent_web_md', disable=['tagging', 'parser'])
matches = ['one', 'two', 'three']
docs = nlp.pipe(matches)
docs
When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.
Fundamentally, a spaCy pipeline package consists of three components: the weights, i.e. binary data loaded in from a directory, a pipeline of functions called in order, and language data like the tokenization rules and language-specific settings.
[43] In the scheme used by spaCy, prepositions are referred to as “adposition” and use a tag ADP. Words like “Friday” or “Obama” are tagged with PROPN, which stands for “proper nouns” reserved for names of known individuals, places, time references, organizations, events and such.
While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.
For iterating through docs just do
for item in docs
or do
list_of_docs = list(docs)
nlp.pipe returns a generator on purpose! Generators are awesome. They are more memory-friendly in that they let you iterate over a series of objects, but unlike a list, they only evaluate the next object when they need to, rather than all at once.
SpaCy is going to turn those strings into sparse matrices, and they're gonna be big. In fact, spaCy is going to turn those strings into Doc objects, which are honkin' big c structs. If your corpus is big enough, storing it all in one variable (e.g., docs = nlp([doc for doc in matches]
or docs = list(nlp.pipe(matches)
) will be inefficient or even impossible. If you're training on any significant amount of data, this won't be a great idea.
Even if it isn't literally impossible, you can do cool things faster if you use the generator as part of a pipeline instead of just dumping it into a list. If you want to extract only certain information, for example, to create a database column of just the named entities, or just the place names in your data, you wouldn't need to store the whole thing in a list and then do a nested for-loop to get them out.
Moreover, the Doc.spans item (and many others) are generators. Similar kinds of data types show up in gensim as well -- half the challenge of NLP is figuring out how to do this stuff in ways that will scale, so it's worth getting used to more efficient containers. (Plus, you can do cooler things with them!)
The official spaCy starter has some notes on scaling and performance in Chapter 3.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With