I would like to extract nouns and compound nouns including hyphens from each sentence like below. If it includes hyphens, I need to extract it with hyphens.
{The T-shirt is old.: ['T-shirt'],
I bought the computer and the new web-cam.: ['computer', 'web-cam'],
I bought the computer and the new web camera.: ['computer', 'web camera']}
Current out put is below. There are labels , 'compound', on the first word of compound nouns, but I cannot extract what I expect for now.
T T PROPN NNP compound X True False
shirt shirt NOUN NN nsubj xxxx True False
computer computer NOUN NN dobj xxxx True False
web web NOUN NN compound xxx True False
cam cam NOUN NN conj xxx True False
computer computer NOUN NN dobj xxxx True False
web web NOUN NN compound xxx True False
camera camera NOUN NN conj xxxx True False
{The T-shirt is old.: ['T -', 'T', 'T -', 'shirt'],
I bought the computer and the new web-cam.: ['web -', 'computer', 'web -', 'web', 'web -', 'cam'],
I bought the computer and the new web camera.: ['web camera', 'computer', 'web camera', 'web', 'web camera', 'camera']}
I'm using the NLP library, spaCy to distinguish nouns and compound nouns. Hope to hear your advice how to fix the current code.
import spacy
nlp = spacy.load("en_core_web_sm")
texts = ["The T-shirt is old.", "I bought the computer and the new web-cam.", "I bought the computer and the new web camera."]
nouns = []*len(texts)
dic = {k: v for k, v in zip(texts, nouns)}
for i in range(len(texts)):
text = nlp(texts[i])
words = []
for word in text:
if word.pos_ == 'NOUN'or word.pos_ == 'PROPN':
print(word.text, word.lemma_, word.pos_, word.tag_, word.dep_,
word.shape_, word.is_alpha, word.is_stop)
#compound words
for j in range(len(text)):
token = text[j]
if token.dep_ == 'compound':
if j < len(text)-1:
nexttoken = text[j+1]
words.append(str(token.text + ' ' + nexttoken.text))
else:
words.append(word.text)
dic[text] = words
print(dic)
Python 3.7.4
spaCy version 2.3.2
Please try:
import spacy
nlp = spacy.load("en_core_web_sm")
texts = ("The T-shirt is old",
"I bought the computer and the new web-cam",
"I bought the computer and the new web camera",
)
docs = nlp.pipe(texts)
compounds = []
for doc in docs:
compounds.append({doc.text:[doc[tok.i:tok.head.i+1] for tok in doc if tok.dep_=="compound"]})
print(compounds)
[{'The T-shirt is old.': [T-shirt]},
{'I bought the computer and the new web-cam.': [web-cam]},
{'I bought the computer and the new web camera.': [web camera]}]
computer is missing from this list but I do not think it qualifies as a compound.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With