I am new to spacy and to nlp overall.
To understand how spacy works I would like to create a function which takes a sentence and returns a dictionary,tuple or list with the noun and the words describing it.
I know that spacy creates a tree of the sentence and knows the use of each word (shown in displacy).
But what's the right way to get from:
"A large room with two yellow dishwashers in it"
To:
{noun:"room",adj:"large"} {noun:"dishwasher",adj:"yellow",adv:"two"}
Or any other solution that gives me all related words in a usable bundle.
Thanks in advance!
Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world's largest tech fund”. To get the noun chunks in a document, simply iterate over Doc.
if (val = = 'NN' or val = = 'NNS' or val = = 'NNPS' or val = = 'NNP' ): print (text, " is a noun." ) else : print (text, " is not a noun." )
While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.
orth is simply an integer that indicates the index of the occurrence of the word that is kept in the spacy.
This is a very straightforward use of the DependencyMatcher.
import spacy
from spacy.matcher import DependencyMatcher
nlp = spacy.load("en_core_web_sm")
pattern = [
{
"RIGHT_ID": "target",
"RIGHT_ATTRS": {"POS": "NOUN"}
},
# founded -> subject
{
"LEFT_ID": "target",
"REL_OP": ">",
"RIGHT_ID": "modifier",
"RIGHT_ATTRS": {"DEP": {"IN": ["amod", "nummod"]}}
},
]
matcher = DependencyMatcher(nlp.vocab)
matcher.add("FOUNDED", [pattern])
text = "A large room with two yellow dishwashers in it"
doc = nlp(text)
for match_id, (target, modifier) in matcher(doc):
print(doc[modifier], doc[target], sep="\t")
Output:
large room
two dishwashers
yellow dishwashers
It should be easy to turn that into a dictionary or whatever you'd like. You might also want to modify it to take proper nouns as the target, or to support other kinds of dependency relations, but this should be a good start.
You may also want to look at the noun chunks feature.
What you want to do is called "noun chunks":
import spacy
nlp = spacy.load('en_core_web_md')
txt = "A large room with two yellow dishwashers in it"
doc = nlp(txt)
chunks = []
for chunk in doc.noun_chunks:
out = {}
root = chunk.root
out[root.pos_] = root
for tok in chunk:
if tok != root:
out[tok.pos_] = tok
chunks.append(out)
print(chunks)
[
{'NOUN': room, 'DET': A, 'ADJ': large},
{'NOUN': dishwashers, 'NUM': two, 'ADJ': yellow},
{'PRON': it}
]
You may notice "noun chunk" doesn't guarantee the root will always be a noun. Should you wish to restrict your results to nouns only:
chunks = []
for chunk in doc.noun_chunks:
out = {}
noun = chunk.root
if noun.pos_ != 'NOUN':
continue
out['noun'] = noun
for tok in chunk:
if tok != noun:
out[tok.pos_] = tok
chunks.append(out)
print(chunks)
[
{'noun': room, 'DET': A, 'ADJ': large},
{'noun': dishwashers, 'NUM': two, 'ADJ': yellow}
]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With