Applying Spacy Parser to Pandas DataFrame w/ Multiprocessing

Tags:

Say I have a dataset, like

iris = pd.DataFrame(sns.load_dataset('iris'))

I can use Spacy and .apply to parse a string column into tokens (my real dataset has >1 word/token per entry of course)

import spacy # (I have version 1.8.2) nlp = spacy.load('en') iris['species_parsed'] = iris['species'].apply(nlp)

result:

   sepal_length   ... species    species_parsed 0           1.4   ... setosa          (setosa) 1           1.4   ... setosa          (setosa) 2           1.3   ... setosa          (setosa)

I can also use this convenient multiprocessing function (thanks to this blogpost) to do most arbitrary apply functions on a dataframe in parallel:

from multiprocessing import Pool, cpu_count def parallelize_dataframe(df, func, num_partitions):      df_split = np.array_split(df, num_partitions)     pool = Pool(num_partitions)     df = pd.concat(pool.map(func, df_split))      pool.close()     pool.join()     return df

for example:

def my_func(df):     df['length_of_word'] = df['species'].apply(lambda x: len(x))     return df  num_cores = cpu_count() iris = parallelize_dataframe(iris, my_func, num_cores)

result:

   sepal_length species  length_of_word 0           5.1  setosa               6 1           4.9  setosa               6 2           4.7  setosa               6

...But for some reason, I can't apply the Spacy parser to a dataframe using multiprocessing this way.

def add_parsed(df):     df['species_parsed'] = df['species'].apply(nlp)     return df  iris = parallelize_dataframe(iris, add_parsed, num_cores)

result:

   sepal_length species  length_of_word species_parsed 0           5.1  setosa               6             () 1           4.9  setosa               6             () 2           4.7  setosa               6             ()

Is there some other way to do this? I'm loving Spacy for NLP but I have a lot of text data and so I'd like to parallelize some processing functions, but ran into this issue.

250

asked Jun 06 '17 16:06

Max Power

1 Answers

Spacy is highly optimised and does the multiprocessing for you. As a result, I think your best bet is to take the data out of the Dataframe and pass it to the Spacy pipeline as a list rather than trying to use .apply directly.

You then need to the collate the results of the parse, and put this back into the Dataframe.

So, in your example, you could use something like:

tokens = [] lemma = [] pos = []  for doc in nlp.pipe(df['species'].astype('unicode').values, batch_size=50,                         n_threads=3):     if doc.is_parsed:         tokens.append([n.text for n in doc])         lemma.append([n.lemma_ for n in doc])         pos.append([n.pos_ for n in doc])     else:         # We want to make sure that the lists of parsed results have the         # same number of entries of the original Dataframe, so add some blanks in case the parse fails         tokens.append(None)         lemma.append(None)         pos.append(None)  df['species_tokens'] = tokens df['species_lemma'] = lemma df['species_pos'] = pos

This approach will work fine on small datasets, but it eats up your memory, so not great if you want to process huge amounts of text.

answered Sep 22 '22 17:09

Ed Rushton

Related questions
                            
                                How to clear the whole cache when using django's page_cache decorator
                            
                                python setup.py sdist only including .py source from top level module
                            
                                Python 2: SMTPServerDisconnected: Connection unexpectedly closed
                            
                                UnicodeEncodeError: 'ascii' codec can't encode character in position 0: ordinal not in range(128)
                            
                                Dependency rule tried to blank out primary key in SQLAlchemy, when foreign key constraint is part of composite primary key
                            
                                ValueError: DataFrame index must be unique for orient='columns'
                            
                                Flask permanent session: where to define them?
                            
                                What are chunks, samples and frames when using pyaudio
                            
                                Number of rows changes even after `pandas.merge` with `left` option
                            
                                Efficient string matching in Apache Spark
                            
                                Is there a way to list the attributes of a class without instantiating an object?
                            
                                collections.Iterable vs typing.Iterable in type annotation and checking for Iterable
                            
                                Index and Slice a Generator in Python
                            
                                What is the difference between StringIO and io.StringIO in Python2.7?
                            
                                Can someone explain this: 0.2 + 0.1 = 0.30000000000000004? [duplicate]
                            
                                How to run functions outside websocket loop in python (tornado)
                            
                                Getting only those values that fulfill a condition in a numpy array
                            
                                How do I apply some function to a python meshgrid?
                            
                                Live stdout output from Python subprocess in Jupyter notebook
                            
                                Regular Expression Matching First Non-Repeated Character

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Applying Spacy Parser to Pandas DataFrame w/ Multiprocessing

Tags:

python

multiprocessing

nlp

spacy

Max Power

People also ask

1 Answers

Ed Rushton

Recent Activity

Donate For Us