Here is a simple dataset:
import pandas as pd
product = ['knife', 'box set', 'beautiful jewellery set on sale', 'green']
df = pd.DataFrame(product, columns = ['product_name'])
df
The output as follows:
| product_name | |
|---|---|
| 0 | knife |
| 1 | box set |
| 2 | beautiful jewellery set on sale |
| 3 | green |
I want to categorise these products by extracting two consecutive nouns if needed. So far I have the following, but the categories are presented by only one noun for all cases:
!pip install -q --upgrade spacy
import spacy
nlp = spacy.load('en_core_web_sm')
category=[]
for i in df['product_name'].tolist():
doc = nlp(i)
for t in doc:
if t.pos_ in ['NOUN']:
category.append(f'{t}')
break
if t.pos_ not in ['NOUN']:
category.append('NaN')
df1 = pd.DataFrame(category, columns =['product_category'])
df1
The output I have:
| product_category | |
|---|---|
| 0 | knife |
| 1 | set |
| 2 | jewellery |
| 3 | NaN |
The expected output:
| product_category | |
|---|---|
| 0 | knife |
| 1 | box set |
| 2 | jewellery set |
| 3 | NaN |
Is it possible to introduce some additional conditions to the code to extract two nouns if they follow one after another?
You can use
import spacy
import pandas as pd
import numpy as np
product = ['knife', 'box set', 'beautiful jewellery set on sale', 'green']
df = pd.DataFrame(product, columns = ['product_name'])
nlp = spacy.load('en_core_web_sm')
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'POS': 'NOUN'},{'POS': 'NOUN','OP':'?'}]
matcher.add('NOUN_PATTERN', [pattern])
def get_two_nouns(x):
doc = nlp(x)
results = []
for match_id, start, end in matcher(doc):
span = doc[start:end]
results.append(span.text)
return max(results, key = lambda x: len(x.split()), default=np.nan)
df['product_name'].apply(get_two_nouns)
Output:
0 knife
1 box set
2 jewellery set
3 NaN
Name: product_name, dtype: object
The pattern = [{'POS': 'NOUN'},{'POS': 'NOUN','OP':'?'}] pattern matches (combinations of) tokens that are both NOUNs. The second one is optional due to the OP operator set to ?.
The return max(results, key = lambda x: len(x.split()), default=np.nan) part returns the item with the longest length (length measured in whitespace separated token count here).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With