Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spacy - nlp.pipe() returns generator

Tags:

python

nlp

spacy

I am using Spacy for NLP in Python. I am trying to use nlp.pipe() to generate a list of Spacy doc objects, which I can then analyze. Oddly enough, nlp.pipe() returns an object of the class <generator object pipe at 0x7f28640fefa0>. How can I get it to return a list of docs, as intended?

import Spacy
nlp = spacy.load('en_depent_web_md', disable=['tagging', 'parser'])
matches = ['one', 'two', 'three']
docs = nlp.pipe(matches)
docs
like image 909
Chris C Avatar asked Jul 16 '18 20:07

Chris C


People also ask

What does NLP () do in spaCy?

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.

What is a spaCy pipeline?

Fundamentally, a spaCy pipeline package consists of three components: the weights, i.e. binary data loaded in from a directory, a pipeline of functions called in order, and language data like the tokenization rules and language-specific settings.

What is Propn in spaCy?

[43] In the scheme used by spaCy, prepositions are referred to as “adposition” and use a tag ADP. Words like “Friday” or “Obama” are tagged with PROPN, which stands for “proper nouns” reserved for names of known individuals, places, time references, organizations, events and such.

Is spaCy better than NLTK?

While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.


2 Answers

For iterating through docs just do

for item in docs

or do

 list_of_docs = list(docs)
like image 186
Bayko Avatar answered Sep 28 '22 16:09

Bayko


nlp.pipe returns a generator on purpose! Generators are awesome. They are more memory-friendly in that they let you iterate over a series of objects, but unlike a list, they only evaluate the next object when they need to, rather than all at once.

SpaCy is going to turn those strings into sparse matrices, and they're gonna be big. In fact, spaCy is going to turn those strings into Doc objects, which are honkin' big c structs. If your corpus is big enough, storing it all in one variable (e.g., docs = nlp([doc for doc in matches] or docs = list(nlp.pipe(matches)) will be inefficient or even impossible. If you're training on any significant amount of data, this won't be a great idea.

Even if it isn't literally impossible, you can do cool things faster if you use the generator as part of a pipeline instead of just dumping it into a list. If you want to extract only certain information, for example, to create a database column of just the named entities, or just the place names in your data, you wouldn't need to store the whole thing in a list and then do a nested for-loop to get them out.

Moreover, the Doc.spans item (and many others) are generators. Similar kinds of data types show up in gensim as well -- half the challenge of NLP is figuring out how to do this stuff in ways that will scale, so it's worth getting used to more efficient containers. (Plus, you can do cooler things with them!)

The official spaCy starter has some notes on scaling and performance in Chapter 3.

like image 41
Ray Johns Avatar answered Sep 28 '22 17:09

Ray Johns