I am trying linguistic feature extraction from text using spacy in python 3. My input looks like this
Sent_id Text
1 I am exploring text analytics using spacy
2 amazing spacy is going to help me
I am looking for an output like this by extracting words as trigram/bigram phrases with a particular POS pattern supplied by me. like NOUN VERB NOUN,ADJ NOUN etc. and retaining the dataframe structure as well. if there are multiple phrases from one sentence then the record has to be duplicated with the new phrase.
Sent_id Text Feature Pattern
1 I am exploring text analytics using spacy exploring text analytics VERB NOUN NOUN
1 I am exploring text analytics using spacy analytics using spacy NOUN VERB NOUN
2 amazing spacy is going to help me amazing spacy ADJ NOUN
Spacy provides a bunch of POS tags such as NOUN (noun), PUNCT (punctuation), ADJ(adjective), ADV(adverb), etc. It has a trained pipeline and statistical models which enable spaCy to make classification of which tag or label a token belongs to. For example, a word following “the” in English is most likely a noun.
While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.
spaCy is a free, open-source library for NLP in Python.
import spacy
import pandas as pd
import re
# Load spacy model once and reuse
nlp = spacy.load('en_core_web_sm')
# The dataframe with text
df = pd.DataFrame({
'Sent_id': [1,2],
'Text': [ "I am exploring text analytics using spacy", "amazing spacy is going to help me"]
})
# Patters we are intrested in
patterns = ["VERB NOUN", "NOUN VERB NOUN"]
# Convert each pattern into regular expression
re_patterns = [" ".join(["(\w+)_!"+pos for pos in p.split()]) for p in patterns]
def extract(nlp, text, patterns, re_patterns):
"""Extracts the pieces in text maching the POS pattern in patterns
args:
nlp : Loaded Spicy model object
text: The input text
patterns: The list of patters to be searched
re_patterns: The patterns converted into regex
returns: A list of tuples of form (t,p) where
t is the part of text matching the pattern p in patterns
"""
doc = nlp(text)
matches = list()
text_pos = " ".join([token.text+"_!"+token.pos_ for token in doc])
for i, pattern in enumerate(re_patterns):
for result in re.findall(pattern, text_pos):
matches.append([" ".join(result), patterns[i]])
return matches
# Test it
print (extract(nlp, "A sleeping cat and walking dog", patterns, re_patterns))
# Returns
# [['sleeping cat', 'VERB NOUN'], ['walking dog', 'VERB NOUN']]
# Extract the matched patterns
df['matches'] = df['Text'].apply(lambda x: extract(nlp,x,patterns,re_patterns))
# Convert the list of tuples into rows
df = df.matches.apply(pd.Series).merge(df, right_index = True, left_index = True).drop(["matches"], axis = 1)\
.melt(id_vars = ['Sent_id', 'Text'], value_name = "matches").drop("variable", axis = 1)
# Add the matched text and matched patterns into new columns
df[['matched_text','matched_pattern']]= df.matches.apply(pd.Series)
# Drop the column and cleanup
df = df.drop("matches", axis = 1).sort_values('Sent_id')
df = df.drop_duplicates(subset =["matched_text", "matched_pattern"], keep='last')
Sent_id Text matched_text matched_pattern
0 1 I am exploring text analytics using spacy exploring text VERB NOUN
2 1 I am exploring text analytics using spacy using spacy VERB NOUN
4 1 I am exploring text analytics using spacy analytics using spacy NOUN VERB NOUN
1 2 amazing spacy is going to help me NaN NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With