I have a large pandas data frame of survey string responses, and we would like to trial some features of Spacy's NLP. We are just exploring the capabilities at the moment, but struggling with how to format the data into a format that works with the nlp function of spacy.
Eventually we would like to be able to look at popular topics in the string responses against their user data.
How do I run the nlp pipeline on a column of a dataframe? Or am I going around this the wrong way?
You begin by calling spacy.load()
with a language model. This will, depending on which model you choose, load tokenizer, tagger, parser, NER and word vectors for the language of your choice. This is stored in a variable called nlp
in the spaCy
documentation.
nlp = spacy.load(language_model)
We can now call nlp()
with any type of text string. So why does not: nlp(df['column_with_strings'])
work? Because df['column_with_strings']
is not a string, it is a pandas.Series
:
TypeError: Argument 'string' has incorrect type (expected str, got Series)
So what you need to do is call nlp()
on each value in the pandas.Series
. You can do this by constructing a function and using df['column_with_strings'].apply()
or by iterating over each row in the series.
There is a more efficient and quick way to parse a Series with texts with the nlp
pipeline by spaCy. SpaCy suggests using nlp.pipe()
when processing large volumes of text.
Following the instructions that are given in the documentation you can do the following:
texts = dataframe['series_with_text]
(Make sure that you have converted the type of the values into strings and you have removed any NaN values that might exist in your data frame).
Then:
docs = list(nlp.pipe(texts))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With