Trouble to extract compound nouns including hyphens in NLP

Question

Background and Goal

I would like to extract nouns and compound nouns including hyphens from each sentence like below. If it includes hyphens, I need to extract it with hyphens.

{The T-shirt is old.: ['T-shirt'], 
I bought the computer and the new web-cam.: ['computer', 'web-cam'], 
I bought the computer and the new web camera.: ['computer', 'web camera']}

problem

Current out put is below. There are labels , 'compound', on the first word of compound nouns, but I cannot extract what I expect for now.

T T PROPN NNP compound X True False
shirt shirt NOUN NN nsubj xxxx True False
computer computer NOUN NN dobj xxxx True False
web web NOUN NN compound xxx True False
cam cam NOUN NN conj xxx True False
computer computer NOUN NN dobj xxxx True False
web web NOUN NN compound xxx True False
camera camera NOUN NN conj xxxx True False

{The T-shirt is old.: ['T -', 'T', 'T -', 'shirt'], 
I bought the computer and the new web-cam.: ['web -', 'computer', 'web -', 'web', 'web -', 'cam'], 
I bought the computer and the new web camera.: ['web camera', 'computer', 'web camera', 'web', 'web camera', 'camera']}

Current Code

I'm using the NLP library, spaCy to distinguish nouns and compound nouns. Hope to hear your advice how to fix the current code.

import spacy
nlp = spacy.load("en_core_web_sm")

texts =  ["The T-shirt is old.", "I bought the computer and the new web-cam.", "I bought the computer and the new web camera."]

nouns = []*len(texts)
dic = {k: v for k, v in zip(texts, nouns)}

for i in range(len(texts)):
    text = nlp(texts[i])
    words = []
    for word in text:
        if word.pos_ == 'NOUN'or word.pos_ == 'PROPN':
            print(word.text, word.lemma_, word.pos_, word.tag_, word.dep_,
                word.shape_, word.is_alpha, word.is_stop)

            #compound words
            for j in range(len(text)):
                    token = text[j]
                    if token.dep_ == 'compound':
                        if j < len(text)-1:
                            nexttoken = text[j+1]
                            words.append(str(token.text + ' ' + nexttoken.text))


            else:
                words.append(word.text)
    dic[text] = words       
print(dic)

Development Environment

Python 3.7.4

spaCy version 2.3.2

Sergey Bushmanov · Accepted Answer

Please try:

import spacy
nlp = spacy.load("en_core_web_sm")

texts =  ("The T-shirt is old",
          "I bought the computer and the new web-cam",
          "I bought the computer and the new web camera",
         )
docs = nlp.pipe(texts)  

compounds = []
for doc in docs:
    compounds.append({doc.text:[doc[tok.i:tok.head.i+1] for tok in doc if tok.dep_=="compound"]})
print(compounds)
[{'The T-shirt is old.': [T-shirt]}, 
{'I bought the computer and the new web-cam.': [web-cam]}, 
{'I bought the computer and the new web camera.': [web camera]}]

computer is missing from this list but I do not think it qualifies as a compound.

Trouble to extract compound nouns including hyphens in NLP

Tags:

python

string

python-3.x

nlp

spacy

Background and Goal

problem

Current Code

Development Environment

1 Answers

Sergey Bushmanov

Recent Activity

Donate For Us

Trouble to extract compound nouns including hyphens in NLP

Tags:

python

string

python-3.x

nlp

spacy

Background and Goal

problem

Current Code

Development Environment

1 Answers

Sergey Bushmanov

Related questions

Recent Activity

Donate For Us