Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to identify abbreviations/acronyms and expand them in spaCy?

I have a large (~50k) term list and a number of these key phrases / terms have corresponding acronyms / abbreviations. I need a fast way of finding either the abbreviation or the expanded abbreviation ( i.e. MS -> Microsoft ) and then replacing that with the full expanded abbreviation + abbreviation ( i.e. Microsoft -> Microsoft (MS) or MS -> Microsoft (MS) ).

I am very new to spaCy, so my naive approach was going to be to use spacy_lookup and use both the abbreviation and the expanded abbreviation as keywords and then using some kind of pipeline extension to then go through the matches and replace them with the full expanded abbreviation + abbreviation.

Is there a better way of tagging and resolving acronyms/abbreviations in spaCy?

like image 763
steve Avatar asked Sep 29 '18 17:09

steve


People also ask

How do you identify abbreviations?

Typically, acronyms and initialisms are written in all capital letters to distinguish them from ordinary words. (When fully spelled out, the words in acronyms and initialisms do not need to be capitalized unless they entail a proper noun.) An acronym is pronounced as a single word, rather than as a series of letters.

How do you add an acronym to a report?

The first time you use an acronym, write the phrase in full and place the acronym in parentheses immediately after it. You can then use the acronym throughout the rest of the text.


1 Answers

Check out scispacy on GitHub, which implements the acronym identification heuristic described in this paper, (see also here). The heuristic works if acronyms are "introduced" in the text with a pattern like

StackOverflow (SO) is a question and answer site for professional and enthusiast programmers. SO rocks!

A working way to replace all acronyms in a piece of text with their long form could then be

import spacy
from scispacy.abbreviation import AbbreviationDetector

nlp = spacy.load("en_core_web_sm")

abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)

text = "StackOverflow (SO) is a question and answer site for professional and enthusiast programmers. SO rocks!"

def replace_acronyms(text):
    doc = nlp(text)
    altered_tok = [tok.text for tok in doc]
    for abrv in doc._.abbreviations:
        altered_tok[abrv.start] = str(abrv._.long_form)

    return(" ".join(altered_tok))

replace_acronyms(text)
like image 127
Davide Fiocco Avatar answered Oct 06 '22 11:10

Davide Fiocco