Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spacy NLP with data from a Pandas DataFrame

I have a large pandas data frame of survey string responses, and we would like to trial some features of Spacy's NLP. We are just exploring the capabilities at the moment, but struggling with how to format the data into a format that works with the nlp function of spacy.

Eventually we would like to be able to look at popular topics in the string responses against their user data.

How do I run the nlp pipeline on a column of a dataframe? Or am I going around this the wrong way?

like image 971
Ben Peddie Avatar asked Sep 03 '25 03:09

Ben Peddie


2 Answers

You begin by calling spacy.load() with a language model. This will, depending on which model you choose, load tokenizer, tagger, parser, NER and word vectors for the language of your choice. This is stored in a variable called nlp in the spaCy documentation.

nlp = spacy.load(language_model)

We can now call nlp() with any type of text string. So why does not: nlp(df['column_with_strings']) work? Because df['column_with_strings'] is not a string, it is a pandas.Series:

TypeError: Argument 'string' has incorrect type (expected str, got Series)

So what you need to do is call nlp() on each value in the pandas.Series. You can do this by constructing a function and using df['column_with_strings'].apply() or by iterating over each row in the series.

like image 139
user3471881 Avatar answered Sep 05 '25 20:09

user3471881


There is a more efficient and quick way to parse a Series with texts with the nlp pipeline by spaCy. SpaCy suggests using nlp.pipe() when processing large volumes of text.

Following the instructions that are given in the documentation you can do the following:

texts = dataframe['series_with_text]

(Make sure that you have converted the type of the values into strings and you have removed any NaN values that might exist in your data frame).

Then:

docs = list(nlp.pipe(texts))
like image 35
Paschalis Ag Avatar answered Sep 05 '25 21:09

Paschalis Ag