I have a large (~50k) term list and a number of these key phrases / terms have corresponding acronyms / abbreviations. I need a fast way of finding either the abbreviation or the expanded abbreviation ( i.e. MS -> Microsoft ) and then replacing that with the full expanded abbreviation + abbreviation ( i.e. Microsoft -> Microsoft (MS) or MS -> Microsoft (MS) ).
I am very new to spaCy, so my naive approach was going to be to use spacy_lookup and use both the abbreviation and the expanded abbreviation as keywords and then using some kind of pipeline extension to then go through the matches and replace them with the full expanded abbreviation + abbreviation.
Is there a better way of tagging and resolving acronyms/abbreviations in spaCy?
Typically, acronyms and initialisms are written in all capital letters to distinguish them from ordinary words. (When fully spelled out, the words in acronyms and initialisms do not need to be capitalized unless they entail a proper noun.) An acronym is pronounced as a single word, rather than as a series of letters.
The first time you use an acronym, write the phrase in full and place the acronym in parentheses immediately after it. You can then use the acronym throughout the rest of the text.
Check out scispacy on GitHub, which implements the acronym identification heuristic described in this paper, (see also here). The heuristic works if acronyms are "introduced" in the text with a pattern like
StackOverflow (SO) is a question and answer site for professional and enthusiast programmers. SO rocks!
A working way to replace all acronyms in a piece of text with their long form could then be
import spacy
from scispacy.abbreviation import AbbreviationDetector
nlp = spacy.load("en_core_web_sm")
abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)
text = "StackOverflow (SO) is a question and answer site for professional and enthusiast programmers. SO rocks!"
def replace_acronyms(text):
doc = nlp(text)
altered_tok = [tok.text for tok in doc]
for abrv in doc._.abbreviations:
altered_tok[abrv.start] = str(abrv._.long_form)
return(" ".join(altered_tok))
replace_acronyms(text)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With