Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply .lower() function to a string for all cases of words in a list

I would like to apply the function .lower() to a string for all of the words that are predefined in a list, but not for any other words. For instance, take the string provided below.

string1 = "ThE QuIcK BroWn foX jUmpEd oVer thE LaZY dOg."

Now say I have a list as seen below:

lower_list = ['quick', 'jumped', 'dog']

My ideal output would be for the function to apply the .lower() to the entire string like this:

string1.lower()

And then the output have the function only apply the .lower() to the instances in string1 that are in the list lower_list as appears below:

> ThE quick BroWn foX jumped oVer thE LaZY dog.

Can this be done in a simple manner? My idea was to use a for loop, but I need to retain the formatting of the string for example say a string has multiple lines and indents on some lines and not others.

EDIT: I am getting the following error

parts[1::2] = (word.lower() for word in parts[1::2]) 
AttributeError: 'NoneType' object has no attribute 'lower'

I believe this might be due to have characters other than letters in the strings i use in lower_list. If I were to have a string like this '(copy)' then I get the above error. Is there a way to get around this? I was thinking of making every split part into a string using str(xxx) but not sure how to do that...

like image 992
Chandler Cree Avatar asked Dec 15 '25 03:12

Chandler Cree


1 Answers

For this kind of problem you should be careful about cases like this one:

>>> phrase = 'the apothecary'
>>> phrase.replace('the', 'THE')
'THE apoTHEcary'

That is, you only want to do the replacements for whole word matches, but it is quite difficult to only match whole words by direct string manipulations, because the boundary of a word can be at a space ' ' character, but it could also be at a full stop '.' or at the start or end of the input string.

Fortunately, regexes make it easy to match whole words, because \b in a regex matches any word boundary. So we can solve the problem this way:

  • Create a regex which matches the words in lower_list, case-insensitive, but only when they have a word boundary before and after them.
  • Split the input string into parts using the regex, capturing the matches.
  • Transform each of the captured matches to lowercase.
  • Join the parts back again.

Because we're splitting on words rather than spaces, this means the original whitespace is preserved exactly. Here's an implementation:

import re

def lowercase_words(string, words):
    regex = r'\b(' + '|'.join(words) + r')\b'
    parts = re.split(regex, string, flags=re.IGNORECASE)
    parts[1::2] = (word.lower() for word in parts[1::2])
    return ''.join(parts)

Example:

>>> lowercase_words(string1, lower_list)
'ThE quick BroWn foX jumped oVer thE LaZY dog.'
>>> lowercase_words('ThE aPoThEcArY', ['the'])
'the aPoThEcArY'
>>> lowercase_words('  HELLO   \n WORLD  ', ['hello', 'world'])
'  hello   \n world  '

The above assumes that the words in lower_list only contain letters. If they might contain other characters, then there are two more problems:

  • We need to escape the special characters, with re.escape.
  • We only want to match word boundaries using \b if the word starts and/or ends with a letter.

The following makes it work:

import re

def lowercase_words(string, words):
    def make_regex_part(word):
        word = re.escape(word)
        if word[:1].isalpha(): word = r'\b' + word
        if word[-1:].isalpha(): word += r'\b'
        return word

    regex = '(' + '|'.join(map(make_regex_part, words)) + ')'
    parts = re.split(regex, string, flags=re.IGNORECASE)
    parts[1::2] = (word.lower() for word in parts[1::2])
    return ''.join(parts)

Example:

>>> lowercase_words('(TrY) iT nOw WiTh bRaCkEtS', ['(try)', 'it'])
'(try) it nOw WiTh bRaCkEtS'
like image 93
kaya3 Avatar answered Dec 16 '25 20:12

kaya3



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!