Lowercase text with regex pattern

Question

I use regex pattern to block acronyms while lower casing text.

The code is

# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals

import codecs
import os
import re

text = "This sentence contains ADS, NASA and K.A. as acronymns."

pattern = r'([A-Z][a-zA-Z]*[A-Z]|(?:[A-Z]\.)+)'
matches = re.findall(pattern, text)

def lowercase_ignore_matches(match):
    word = match.group()
    if word in matches:
        return word
    return word.lower()

text2 = re.sub(r"\w+", lowercase_ignore_matches, text)

print(text)
print(text2)

matches = re.findall(pattern, text)
print (matches)

output is

This sentence contains ADS, NASA and K.A. as acronymns.
this sentence contains ADS, NASA and k.a. as acronymns.
['ADS', 'NASA', 'K.A.']

The issue is why is it ignoring k.a. while identifying it as acronymns.

I wish to retain k.a. as K.A.

Kindly help

DTNGNR · Accepted Answer

The solution with r[\w\.] works in this case but will struggle if the acronym is at the end of a line with a dot after it (i.e. "[...] or ASDF." We use the pattern to identify every acronym, than lowercase the whole string and then replace the acronyms again with their original value.

I changed the pattern a bit so that it also supports acronym like "eFUEL"

# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals

import codecs
import os
import re

text = "This sentence contains ADS, NASA and K.A. as acronymns. eFUEL or ASDF."

pattern = r'\b([A-z]*?[A-Z](?:\.?[A-Z])+[A-z]*)'

# Find all matches of the pattern in the text
matches = re.findall(pattern, text)

# Make everything lowercase
text2 = text.lower()

# Replace each match with its original uppercase version
for match in matches:
    text2 = text2.replace(match.lower(), match)

print(text2)

The result is:

['ADS', 'NASA', 'K.A', 'eFUEL', 'ASDF']
this sentence contains ADS, NASA and K.A. as acronymns. eFUEL or ASDF.

Lowercase text with regex pattern

Tags:

python

regex

Programmer_nltk

1 Answers

DTNGNR

Recent Activity

Donate For Us

Lowercase text with regex pattern

Tags:

python

regex

Programmer_nltk

1 Answers

DTNGNR

Related questions

Recent Activity

Donate For Us