I am using Spacy for NLP in Python. I am trying to use <code>nlp.pipe()</code> to generate a list of Spacy doc objects, which I can then analyze. Oddly enough, <code>nlp.pipe()</code> returns an object of the class <code><generator object pipe at 0x7f28640fefa0></code>. How can I get it to return a list of docs, as intended? <pre class="prettyprint"><code>import Spacy nlp = spacy.load('en_depent_web_md', disable=['tagging', 'parser']) matches = ['one', 'two', 'three'] docs = nlp.pipe(matches) docs </code></pre>

For iterating through docs just do <pre class="prettyprint"><code>for item in docs </code></pre> or do <pre class="prettyprint"><code> list_of_docs = list(docs) </code></pre>

Spacy - nlp.pipe() returns generator

Tags:

I am using Spacy for NLP in Python. I am trying to use nlp.pipe() to generate a list of Spacy doc objects, which I can then analyze. Oddly enough, nlp.pipe() returns an object of the class <generator object pipe at 0x7f28640fefa0>. How can I get it to return a list of docs, as intended?

import Spacy
nlp = spacy.load('en_depent_web_md', disable=['tagging', 'parser'])
matches = ['one', 'two', 'three']
docs = nlp.pipe(matches)
docs

909

asked Jul 16 '18 20:07

Chris C

2 Answers

For iterating through docs just do

for item in docs

or do

 list_of_docs = list(docs)

186

answered Sep 28 '22 16:09

Bayko

nlp.pipe returns a generator on purpose! Generators are awesome. They are more memory-friendly in that they let you iterate over a series of objects, but unlike a list, they only evaluate the next object when they need to, rather than all at once.

SpaCy is going to turn those strings into sparse matrices, and they're gonna be big. In fact, spaCy is going to turn those strings into Doc objects, which are honkin' big c structs. If your corpus is big enough, storing it all in one variable (e.g., docs = nlp([doc for doc in matches] or docs = list(nlp.pipe(matches)) will be inefficient or even impossible. If you're training on any significant amount of data, this won't be a great idea.

Even if it isn't literally impossible, you can do cool things faster if you use the generator as part of a pipeline instead of just dumping it into a list. If you want to extract only certain information, for example, to create a database column of just the named entities, or just the place names in your data, you wouldn't need to store the whole thing in a list and then do a nested for-loop to get them out.

Moreover, the Doc.spans item (and many others) are generators. Similar kinds of data types show up in gensim as well -- half the challenge of NLP is figuring out how to do this stuff in ways that will scale, so it's worth getting used to more efficient containers. (Plus, you can do cooler things with them!)

The official spaCy starter has some notes on scaling and performance in Chapter 3.

answered Sep 28 '22 17:09

Ray Johns

Related questions
                            
                                Python Plotly - Multiple dropdown plots, each of which have subplots
                            
                                How to display 16-bit 4096 intensity image in Python openCV?
                            
                                How to divide each column of pandas Dataframe by a Series?
                            
                                psycopg2.extras.DictCursor not returning dict in postgres
                            
                                Why does the result of scipy.sparse.csc_matrix.sum() change its type to numpy matrix?
                            
                                Simple way to print binary numbers in groups of nibbles
                            
                                PySpark Boolean Pivot
                            
                                plot two seaborn heatmap graphs side by side
                            
                                Cachetools for subsequent runs in python
                            
                                Could not convert string to float error from the Titanic competition
                            
                                String Operation on captured group in re Python
                            
                                What is the most efficient way of doing square root of sum of square of two numbers?
                            
                                Move a worksheet in a workbook using openpyxl or xl* or xlsxwriter?
                            
                                Check if string can be splitted into sentence using words in provided list
                            
                                Keras seems to hang after call to fit_generator
                            
                                Aggregating Rows Pandas
                            
                                How to get today - “6 months” date in PySpark(SQL) [duplicate]
                            
                                pipenv only installs .venv in home directory
                            
                                No 'print' output when using yield?
                            
                                Kivy Multiple Column RecyclerView

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spacy - nlp.pipe() returns generator

Tags:

python

nlp

spacy

Chris C

People also ask

2 Answers

Bayko

Ray Johns

Recent Activity

Donate For Us