Tokenizing using Pandas and spaCy

Tags:

I'm working on my first Python project and have reasonably large dataset (10's of thousands of rows). I need to do some nlp (clustering, classification) on 5 text columns (multiple sentences of text per 'cell') and have been using pandas to organize/build the dataset. I'm hoping to use spaCy for all the nlp but can't quite figure out how to tokenize the text in my columns. I've read a bunch of the spaCy documentation, and googled around but all the examples I've found are for a single sentence or word - not 75K rows in a pandas df.

I've tried things like: df['new_col'] = [token for token in (df['col'])]

but would definitely appreciate some help/resources.

full (albeit messy) code available here

326

asked Oct 27 '17 18:10

LMGagne

1 Answers

I've never used spaCy (nltk has always gotten the job done for me) but from glancing at the documentation it looks like this should work:

import spacy
nlp = spacy.load('en')

df['new_col'] = df['text'].apply(lambda x: nlp(x))

Note that nlp by default runs the entire SpaCy pipeline, which includes part-of-speech tagging, parsing and named entity recognition. You can significantly speed up your code by using nlp.tokenizer(x) instead of nlp(x), or by disabling parts of the pipeline when you load the model. E.g. nlp = spacy.load('en', parser=False, entity=False).

179

answered Sep 23 '22 13:09

Peter

Related questions
                            
                                Best way to write Python 2 and 3 compatible code using nothing but the standard library
                            
                                Optimization Break-even Point: iterate many times over set or convert to list first?
                            
                                Finding the position of a word in a string
                            
                                Django Rest Framework - Nested Serialization not working as expected
                            
                                How to read the csv file properly if each row contains different number of fields (number quite big)?
                            
                                pyspark matrix with dummy variables
                            
                                Run all functions in class
                            
                                Python 'in' keyword in expression vs. in for loop [duplicate]
                            
                                Python "split" on empty new line
                            
                                Create a DataFrame with a MultiIndex
                            
                                Tensorflow: Using tf.slice to split the input
                            
                                Beautifulsoup decompose()
                            
                                keras error on predict
                            
                                qApp versus QApplication.instance()
                            
                                Matplotlib 3D scatter animations
                            
                                "DataFrame" object has no attribute 'reshape'
                            
                                End loop with counter and condition
                            
                                How to create a new log file every time the application runs?
                            
                                Importing JSON into Pandas
                            
                                Pandas dataframe conditional mean based on column names

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Tokenizing using Pandas and spaCy

Tags:

python

python-3.x

pandas

tokenize

spacy

LMGagne

People also ask

1 Answers

Peter

Recent Activity

Donate For Us