I am new to Spacy
and I would like to extract "all" the noun phrases from a sentence. I'm wondering how I can do it. I have the following code:
import spacy
nlp = spacy.load("en")
file = open("E:/test.txt", "r")
doc = nlp(file.read())
for np in doc.noun_chunks:
print(np.text)
But it returns only the base noun phrases, that is, phrases which don't have any other NP
in them. That is, for the following phrase, I get the result below:
Phrase: We try to explicitly describe the geometry of the edges of the images.
Result: We, the geometry, the edges, the images
.
Expected result: We, the geometry, the edges, the images, the geometry of the edges of the images, the edges of the images.
How can I get all the noun phrases, including nested phrases?
Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world's largest tech fund”. To get the noun chunks in a document, simply iterate over Doc.
noun_phrases() method. With the help of TextBlob. noun_phrases() method, we can get the noun phrases of the sentences by using TextBlob.
NLP helps you extract insights from unstructured text and has several use cases, such as: Automatic summarization. Named entity recognition. Question answering systems.
Noun chunks is a core feature of Natural Language Processing. They are known as "noun phrases" in linguistics. Basicall they are nouns and all the words that depend on these nouns. For example, let's say you have the following sentence: John Doe has been working for the Microsoft company in Seattle since 1999.
Please see commented code below to recursively combine the nouns. Code inspired by the Spacy Docs here
import spacy
nlp = spacy.load("en")
doc = nlp("We try to explicitly describe the geometry of the edges of the images.")
for np in doc.noun_chunks: # use np instead of np.text
print(np)
print()
# code to recursively combine nouns
# 'We' is actually a pronoun but included in your question
# hence the token.pos_ == "PRON" part in the last if statement
# suggest you extract PRON separately like the noun-chunks above
index = 0
nounIndices = []
for token in doc:
# print(token.text, token.pos_, token.dep_, token.head.text)
if token.pos_ == 'NOUN':
nounIndices.append(index)
index = index + 1
print(nounIndices)
for idxValue in nounIndices:
doc = nlp("We try to explicitly describe the geometry of the edges of the images.")
span = doc[doc[idxValue].left_edge.i : doc[idxValue].right_edge.i+1]
span.merge()
for token in doc:
if token.dep_ == 'dobj' or token.dep_ == 'pobj' or token.pos_ == "PRON":
print(token.text)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With