Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tokenizing using Pandas and spaCy

I'm working on my first Python project and have reasonably large dataset (10's of thousands of rows). I need to do some nlp (clustering, classification) on 5 text columns (multiple sentences of text per 'cell') and have been using pandas to organize/build the dataset. I'm hoping to use spaCy for all the nlp but can't quite figure out how to tokenize the text in my columns. I've read a bunch of the spaCy documentation, and googled around but all the examples I've found are for a single sentence or word - not 75K rows in a pandas df.

I've tried things like: df['new_col'] = [token for token in (df['col'])]

but would definitely appreciate some help/resources.

full (albeit messy) code available here

like image 326
LMGagne Avatar asked Oct 27 '17 18:10

LMGagne


People also ask

How do I use Tokenize documents with spaCy?

In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right. First, the tokenizer split the text on whitespace similar to the split() function. Then the tokenizer checks whether the substring matches the tokenizer exception rules.

Which is better NLTK or spaCy?

While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.

What does NLP () do in spaCy?

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.

What is Propn in spaCy?

Login to get full access to this book. [43] In the scheme used by spaCy, prepositions are referred to as “adposition” and use a tag ADP. Words like “Friday” or “Obama” are tagged with PROPN, which stands for “proper nouns” reserved for names of known individuals, places, time references, organizations, events and such.


1 Answers

I've never used spaCy (nltk has always gotten the job done for me) but from glancing at the documentation it looks like this should work:

import spacy
nlp = spacy.load('en')

df['new_col'] = df['text'].apply(lambda x: nlp(x))

Note that nlp by default runs the entire SpaCy pipeline, which includes part-of-speech tagging, parsing and named entity recognition. You can significantly speed up your code by using nlp.tokenizer(x) instead of nlp(x), or by disabling parts of the pipeline when you load the model. E.g. nlp = spacy.load('en', parser=False, entity=False).

like image 179
Peter Avatar answered Sep 23 '22 13:09

Peter