I want to do a noisy resolution such that given a personal prounoun, that pronoun is replace by the previous(nearest) person.
For example:
Alex is looking at buying a U.K. startup for $1 billion. He is very confident that this is going to happen. Sussan is also in the same situation. However, she has lost hope.
the output is:
Alex is looking at buying a U.K. startup for $1 billion. Alex is very confident that this is going to happen. Sussan is also in the same situation. However, Susan has lost hope.
Another example,
Peter is a friend of Gates. But Gates does not like him.
In this case, the output would be :
Peter is a friend of Gates. But Gates does not like Gates.
Yes! This is super noisy.
Using spacy:
I have extracted the Person
using NER, but how can I replace pronouns appropriately?
Code:
import spacy
nlp = spacy.load("en_core_web_sm")
for ent in doc.ents:
if ent.label_ == 'PERSON':
print(ent.text, ent.label_)
There is specially dedicated neuralcoref library to resolve coreference. See the minimal reproducible example below:
import spacy
import neuralcoref
nlp = spacy.load('en_core_web_sm')
neuralcoref.add_to_pipe(nlp)
doc = nlp(
'''Alex is looking at buying a U.K. startup for $1 billion.
He is very confident that this is going to happen.
Sussan is also in the same situation.
However, she has lost hope.
Peter is a friend of Gates. But Gates does not like him.
''')
print(doc._.coref_resolved)
Alex is looking at buying a U.K. startup for $1 billion.
Alex is very confident that this is going to happen.
Sussan is also in the same situation.
However, Sussan has lost hope.
Peter is a friend of Gates. But Gates does not like Peter.
Note, you may have some issues with neuralcoref
if you pip install it, so it's better to build it from source, as I outlined it here
I have written a function that works for your two examples:
Consider using a larger model such as en_core_web_lg
for more accurate tagging.
import spacy
from string import punctuation
nlp = spacy.load("en_core_web_lg")
def pronoun_coref(text):
doc = nlp(text)
pronouns = [(tok, tok.i) for tok in doc if (tok.tag_ == "PRP")]
names = [(ent.text, ent[0].i) for ent in doc.ents if ent.label_ == 'PERSON']
doc = [tok.text_with_ws for tok in doc]
for p in pronouns:
replace = max(filter(lambda x: x[1] < p[1], names),
key=lambda x: x[1], default=False)
if replace:
replace = replace[0]
if doc[p[1] - 1] in punctuation:
replace = ' ' + replace
if doc[p[1] + 1] not in punctuation:
replace = replace + ' '
doc[p[1]] = replace
doc = ''.join(doc)
return doc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With