Logo Questions Linux Laravel Mysql Ubuntu Git Menu

POS pattern mining with spacy

I am trying linguistic feature extraction from text using spacy in python 3. My input looks like this

Sent_id Text
1   I am exploring text analytics using spacy
2   amazing spacy is going to help me

I am looking for an output like this by extracting words as trigram/bigram phrases with a particular POS pattern supplied by me. like NOUN VERB NOUN,ADJ NOUN etc. and retaining the dataframe structure as well. if there are multiple phrases from one sentence then the record has to be duplicated with the new phrase.

Sent_id Text    Feature Pattern
1   I am exploring text analytics using spacy   exploring text analytics    VERB NOUN NOUN
1   I am exploring text analytics using spacy   analytics using spacy   NOUN VERB NOUN
2   amazing spacy is going to help me   amazing spacy   ADJ NOUN
like image 416
Basudev Avatar asked Mar 28 '19 08:03


People also ask

How does spaCy do POS tagging?

Spacy provides a bunch of POS tags such as NOUN (noun), PUNCT (punctuation), ADJ(adjective), ADV(adverb), etc. It has a trained pipeline and statistical models which enable spaCy to make classification of which tag or label a token belongs to. For example, a word following “the” in English is most likely a noun.

Which is better NLTK or spaCy?

While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.

Is spaCy NLP free?

spaCy is a free, open-source library for NLP in Python.

1 Answers

Code is explained in the comments

import spacy
import pandas as pd
import re

# Load spacy model once and reuse 
nlp = spacy.load('en_core_web_sm')

# The dataframe with text
df = pd.DataFrame({
        'Sent_id': [1,2],
        'Text': [ "I am exploring text analytics using spacy", "amazing spacy is going to help me"]

# Patters we are intrested in 
patterns = ["VERB NOUN", "NOUN VERB NOUN"]

# Convert each pattern into regular expression
re_patterns = [" ".join(["(\w+)_!"+pos for pos in p.split()]) for p in patterns]

def extract(nlp, text, patterns, re_patterns):
    """Extracts the pieces in text maching the POS pattern in patterns

        nlp : Loaded Spicy model object
        text: The input text
        patterns: The list of patters to be searched
        re_patterns: The patterns converted into regex

    returns: A list of tuples of form (t,p) where 
    t is the part of text matching the pattern p in patterns
    doc = nlp(text)   
    matches = list()
    text_pos = " ".join([token.text+"_!"+token.pos_ for token in doc])
    for i, pattern in enumerate(re_patterns):
        for result in re.findall(pattern, text_pos):
            matches.append([" ".join(result), patterns[i]])
    return matches

# Test it 
print (extract(nlp, "A sleeping cat and walking dog", patterns, re_patterns))
# Returns
# [['sleeping cat', 'VERB NOUN'], ['walking dog', 'VERB NOUN']]

# Extract the matched patterns
df['matches'] = df['Text'].apply(lambda x: extract(nlp,x,patterns,re_patterns))

# Convert the list of tuples into rows
df = df.matches.apply(pd.Series).merge(df, right_index = True, left_index = True).drop(["matches"], axis = 1)\
.melt(id_vars = ['Sent_id', 'Text'], value_name = "matches").drop("variable", axis = 1)

# Add the matched text and matched patterns into new columns
df[['matched_text','matched_pattern']]= df.matches.apply(pd.Series)

# Drop the column and cleanup
df = df.drop("matches", axis = 1).sort_values('Sent_id')
df = df.drop_duplicates(subset =["matched_text", "matched_pattern"], keep='last') 


    Sent_id     Text                                matched_text    matched_pattern
0   1   I am exploring text analytics using spacy   exploring text  VERB NOUN
2   1   I am exploring text analytics using spacy   using spacy     VERB NOUN
4   1   I am exploring text analytics using spacy   analytics using spacy   NOUN VERB NOUN
1   2   amazing spacy is going to help me           NaN              NaN
like image 100
mujjiga Avatar answered Oct 14 '22 23:10
