Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spaCy NLP custom rule matcher

I am begginer with NLP. I am using spaCy python library for my NLP project. Here is my requirement,

I have a JSON File with all country names. Now i need to parse and get goldmedal count for the each countries in the document. Given below the sample sentence,

"Czech Republic won 5 gold medals at olympics. Slovakia won 0 medals olympics"

I am able to fetch country names but not it medal count. Given below my code. Please help to proceed further.

import json
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher

with open("C:\Python36\srclcl\countries.json") as f:
    COUNTRIES = json.loads(f.read())

nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))
doc = nlp("Czech Republic won 5 gold medals at olympics. Slovakia won 0 medals olympics")
matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(COUNTRIES))

matcher.add("COUNTRY", None, *patterns)


for sent in doc.sents:
    subdoc = nlp(sent.text)
    matches = matcher(subdoc)
    print (sent.text)
    for match_id, start, end in matches:
        print(subdoc[start:end].text)

Also, if the given text is like ,

"Czech Republic won 5 gold medals at olympics in 1995. Slovakia won 0 medals olympics"
like image 677
user3383301 Avatar asked Oct 16 '22 12:10

user3383301


1 Answers

Spacy provides Rule-based matching which you could use.

They can be used as follows:

import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable=["ner", "parser"])

countries = ['Czech Republic', 'Slovakia']
ruler = EntityRuler(nlp)
for a in countries:
    ruler.add_patterns([{"label": "country", "pattern": a}])
nlp.add_pipe(ruler)


doc = nlp("Czech Republic won 5 gold medals at olympics. Slovakia won 0 medals olympics")

with doc.retokenize() as retokenizer:
    for ent in doc.ents:
        retokenizer.merge(doc[ent.start:ent.end])


from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern =[{'ENT_TYPE': 'country'}, {'lower': 'won'},{"IS_DIGIT": True}]
matcher.add('medal', None, pattern)
matches = matcher(doc)


for match_id, start, end in matches:
    span = doc[start:end]
    print(span)

output:

Czech Republic won 5
Slovakia won 0

The above code should get you started. Naturally, you will have to write your own more complex rules so that you can handle cases like: "Czech Republic unsurprisingly won 5 gold medals at olympics in 1995." And other more complex sentence structures.

like image 60
DBaker Avatar answered Oct 20 '22 22:10

DBaker