Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble to extract compound nouns including hyphens in NLP

Background and Goal

I would like to extract nouns and compound nouns including hyphens from each sentence like below. If it includes hyphens, I need to extract it with hyphens.

{The T-shirt is old.: ['T-shirt'], 
I bought the computer and the new web-cam.: ['computer', 'web-cam'], 
I bought the computer and the new web camera.: ['computer', 'web camera']}

problem

Current out put is below. There are labels , 'compound', on the first word of compound nouns, but I cannot extract what I expect for now.

T T PROPN NNP compound X True False
shirt shirt NOUN NN nsubj xxxx True False
computer computer NOUN NN dobj xxxx True False
web web NOUN NN compound xxx True False
cam cam NOUN NN conj xxx True False
computer computer NOUN NN dobj xxxx True False
web web NOUN NN compound xxx True False
camera camera NOUN NN conj xxxx True False

{The T-shirt is old.: ['T -', 'T', 'T -', 'shirt'], 
I bought the computer and the new web-cam.: ['web -', 'computer', 'web -', 'web', 'web -', 'cam'], 
I bought the computer and the new web camera.: ['web camera', 'computer', 'web camera', 'web', 'web camera', 'camera']}

Current Code

I'm using the NLP library, spaCy to distinguish nouns and compound nouns. Hope to hear your advice how to fix the current code.

import spacy
nlp = spacy.load("en_core_web_sm")

texts =  ["The T-shirt is old.", "I bought the computer and the new web-cam.", "I bought the computer and the new web camera."]

nouns = []*len(texts)
dic = {k: v for k, v in zip(texts, nouns)}

for i in range(len(texts)):
    text = nlp(texts[i])
    words = []
    for word in text:
        if word.pos_ == 'NOUN'or word.pos_ == 'PROPN':
            print(word.text, word.lemma_, word.pos_, word.tag_, word.dep_,
                word.shape_, word.is_alpha, word.is_stop)

            #compound words
            for j in range(len(text)):
                    token = text[j]
                    if token.dep_ == 'compound':
                        if j < len(text)-1:
                            nexttoken = text[j+1]
                            words.append(str(token.text + ' ' + nexttoken.text))


            else:
                words.append(word.text)
    dic[text] = words       
print(dic)

Development Environment

Python 3.7.4

spaCy version 2.3.2


1 Answers

Please try:

import spacy
nlp = spacy.load("en_core_web_sm")

texts =  ("The T-shirt is old",
          "I bought the computer and the new web-cam",
          "I bought the computer and the new web camera",
         )
docs = nlp.pipe(texts)  

compounds = []
for doc in docs:
    compounds.append({doc.text:[doc[tok.i:tok.head.i+1] for tok in doc if tok.dep_=="compound"]})
print(compounds)
[{'The T-shirt is old.': [T-shirt]}, 
{'I bought the computer and the new web-cam.': [web-cam]}, 
{'I bought the computer and the new web camera.': [web camera]}]

computer is missing from this list but I do not think it qualifies as a compound.

like image 142
Sergey Bushmanov Avatar answered Oct 20 '25 06:10

Sergey Bushmanov