POS pattern mining with spacy

Tags:

spacy

I am trying linguistic feature extraction from text using spacy in python 3. My input looks like this

Sent_id Text
1   I am exploring text analytics using spacy
2   amazing spacy is going to help me

I am looking for an output like this by extracting words as trigram/bigram phrases with a particular POS pattern supplied by me. like NOUN VERB NOUN,ADJ NOUN etc. and retaining the dataframe structure as well. if there are multiple phrases from one sentence then the record has to be duplicated with the new phrase.

Sent_id Text    Feature Pattern
1   I am exploring text analytics using spacy   exploring text analytics    VERB NOUN NOUN
1   I am exploring text analytics using spacy   analytics using spacy   NOUN VERB NOUN
2   amazing spacy is going to help me   amazing spacy   ADJ NOUN

416

asked Mar 28 '19 08:03

Basudev

1 Answers

Code is explained in the comments

import spacy
import pandas as pd
import re

# Load spacy model once and reuse 
nlp = spacy.load('en_core_web_sm')

# The dataframe with text
df = pd.DataFrame({
        'Sent_id': [1,2],
        'Text': [ "I am exploring text analytics using spacy", "amazing spacy is going to help me"]
    }) 

# Patters we are intrested in 
patterns = ["VERB NOUN", "NOUN VERB NOUN"]

# Convert each pattern into regular expression
re_patterns = [" ".join(["(\w+)_!"+pos for pos in p.split()]) for p in patterns]


def extract(nlp, text, patterns, re_patterns):
    """Extracts the pieces in text maching the POS pattern in patterns

    args:
        nlp : Loaded Spicy model object
        text: The input text
        patterns: The list of patters to be searched
        re_patterns: The patterns converted into regex

    returns: A list of tuples of form (t,p) where 
    t is the part of text matching the pattern p in patterns
    """
    doc = nlp(text)   
    matches = list()
    text_pos = " ".join([token.text+"_!"+token.pos_ for token in doc])
    for i, pattern in enumerate(re_patterns):
        for result in re.findall(pattern, text_pos):
            matches.append([" ".join(result), patterns[i]])
    return matches

# Test it 
print (extract(nlp, "A sleeping cat and walking dog", patterns, re_patterns))
# Returns
# [['sleeping cat', 'VERB NOUN'], ['walking dog', 'VERB NOUN']]

# Extract the matched patterns
df['matches'] = df['Text'].apply(lambda x: extract(nlp,x,patterns,re_patterns))


# Convert the list of tuples into rows
df = df.matches.apply(pd.Series).merge(df, right_index = True, left_index = True).drop(["matches"], axis = 1)\
.melt(id_vars = ['Sent_id', 'Text'], value_name = "matches").drop("variable", axis = 1)

# Add the matched text and matched patterns into new columns
df[['matched_text','matched_pattern']]= df.matches.apply(pd.Series)

# Drop the column and cleanup
df = df.drop("matches", axis = 1).sort_values('Sent_id')
df = df.drop_duplicates(subset =["matched_text", "matched_pattern"], keep='last')

Output:

    Sent_id     Text                                matched_text    matched_pattern
0   1   I am exploring text analytics using spacy   exploring text  VERB NOUN
2   1   I am exploring text analytics using spacy   using spacy     VERB NOUN
4   1   I am exploring text analytics using spacy   analytics using spacy   NOUN VERB NOUN
1   2   amazing spacy is going to help me           NaN              NaN

100

answered Oct 14 '22 23:10

mujjiga

Related questions
                            
                                Adding seaborn clustermap to figure with other plots
                            
                                Is it possible to redefine keywords in Python?
                            
                                using __setitem__ requires to also implement __len__ in python 2
                            
                                Python design patterns: Nested Abstract Classes
                            
                                What is the purpose of decorators (why use them)?
                            
                                Django: do not create migration when adding custom manager to auth.User
                            
                                How to highlight SQL in PyCharm
                            
                                Unable to stream frames from camera to QML
                            
                                plot_decision_regions with error "Filler values must be provided when X has more than 2 training features."
                            
                                List of maximum values of columns in a matrix (without Numpy)
                            
                                Migrate anaconda from python v3.6 to v3.7 and preserve all conda and pip packages
                            
                                inspect.signature with PEP 563
                            
                                How to calculate np.cov on a matrix with np.nan values without converting to pd.DataFrame?
                            
                                Which of these is the best practice for accessing a variable in a class? [closed]
                            
                                Multiprocessing AsyncResult.get() hangs in Python 3.7.2 but not in 3.6
                            
                                How to redirect url from middleware in Django?
                            
                                Datetime, pandas, and timezone woes: AttributeError: 'datetime.timezone' object has no attribute '_utcoffset'
                            
                                Insert cells in empty Pandas DataFrame
                            
                                tensorflow 2 api regression tensorflow.python.framework.ops.EagerTensor' object is not callable
                            
                                when I use PIL to paste a crop to another image it raises ValueError

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With