I want to remove periods in acronyms from a string of text, but I also want o leave regular periods (at the end of a sentence for example) in tact.
So the following sentence:
"The C.I.A. is a department in the U.S. Government."
Should become
"The CIA is a department in the US Government."
Is there a clean way to do this using Python? So far I have a two step process:
words = "The C.I.A. is a department in the U.S. Government."
words = re.sub(r'([A-Z].[A-Z.]*)\.', r'\1', words)
print words
# The C.I.A is a department in the U.S Government.
words = re.sub(r'\.([A-Z])', r'\1', words)
print words
# The CIA is a department in the US Government.
Abbreviations/Acronyms Abbreviations and acronyms are used to save space and to avoid distracting the reader. Acronyms that abbreviate three or more words are usually written without periods (exception is U.S.S.R.). Abbreviations should only be used if the organization or term appears two or more times in the text.
In American English, we always put a period after an abbreviation; it doesn't matter whether the abbreviation is the first two letters of the word (as in Dr. for Drive) or the first and last letter (as in Dr. for Doctor).
The current style is to use periods with most lowercase and mixed-case abbreviations (examples: a.m., etc., vol., Inc., Jr., Mrs., Tex.) and to omit periods with most uppercase abbreviations (examples: FBI, IRS, ATM, NATO, NBC, TX).
Probably this?
>>> re.sub(r'(?<!\w)([A-Z])\.', r'\1', s)
'The CIA is a department in the US Government.'
Replace single dots that are preceded by an uppercase single letter provided the single letter is not immediately preceded by anything in the \w
character set. The later criterion is enforced by the negative lookbehind assertion - (?<!\w)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With