I use regex pattern to block acronyms while lower casing text.
The code is
# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
import codecs
import os
import re
text = "This sentence contains ADS, NASA and K.A. as acronymns."
pattern = r'([A-Z][a-zA-Z]*[A-Z]|(?:[A-Z]\.)+)'
matches = re.findall(pattern, text)
def lowercase_ignore_matches(match):
word = match.group()
if word in matches:
return word
return word.lower()
text2 = re.sub(r"\w+", lowercase_ignore_matches, text)
print(text)
print(text2)
matches = re.findall(pattern, text)
print (matches)
output is
This sentence contains ADS, NASA and K.A. as acronymns.
this sentence contains ADS, NASA and k.a. as acronymns.
['ADS', 'NASA', 'K.A.']
The issue is why is it ignoring k.a. while identifying it as acronymns.
I wish to retain k.a. as K.A.
Kindly help
The solution with r[\w\.] works in this case but will struggle if the acronym is at the end of a line with a dot after it (i.e. "[...] or ASDF." We use the pattern to identify every acronym, than lowercase the whole string and then replace the acronyms again with their original value.
I changed the pattern a bit so that it also supports acronym like "eFUEL"
# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
import codecs
import os
import re
text = "This sentence contains ADS, NASA and K.A. as acronymns. eFUEL or ASDF."
pattern = r'\b([A-z]*?[A-Z](?:\.?[A-Z])+[A-z]*)'
# Find all matches of the pattern in the text
matches = re.findall(pattern, text)
# Make everything lowercase
text2 = text.lower()
# Replace each match with its original uppercase version
for match in matches:
text2 = text2.replace(match.lower(), match)
print(text2)
The result is:
['ADS', 'NASA', 'K.A', 'eFUEL', 'ASDF']
this sentence contains ADS, NASA and K.A. as acronymns. eFUEL or ASDF.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With