Substring replacements based on replace and no-replace rules

Question

I have a string and rules/mappings for replacement and no-replacements.

E.g.

"This is an example sentence that needs to be processed into a new sentence."
"This is a second example sentence that shows how 'sentence' in 'sentencepiece' should not be replaced."

Replacement rules:

replace_dictionary = {'sentence': 'processed_sentence'}
no_replace_set = {'example sentence'}

Result:

"This is an example sentence that needs to be processed into a new processed_sentence."
"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."

Additional criteria:

Only replace if case is matched, i.e. case matters.
Whole words replacement only, interpunction should be ignored, but kept after replacement.

I was thinking what would the cleanest way to solve this problem in Python 3.x be?

MichaelJanz · Accepted Answer

Based on the answer of demongolem.

UPDATE

I am sorry, I missed the fact, that only whole words should be replaced. I updated my code and even generalized it for usage in a function.

def replace_whole(sentence, replace_token, replace_with, dont_replace):
    rx = f"[\"\'\.,:; ]({replace_token})[\"\'\.,:; ]"
    iter = re.finditer(rx, sentence)
    out_sentence = ""
    found = []
    indices = []
    for m in iter:
        indices.append(m.start(0))
        found.append(m.group())

    context_size=len(dont_replace)
    for i in range(len(indices)):
        context = sentence[indices[i]-context_size:indices[i]+context_size]
        if dont_replace in context:
            continue
        else:
            # First replace the word only in the substring found
            to_replace = found[i].replace(replace_token, replace_with)
            # Then replace the word in the context found, so any special token like "" or . gets taken over and the context does not change
            replace_val = context.replace(found[i], to_replace)
            # finally replace the context found with the replacing context
            out_sentence = sentence.replace(context, replace_val)
            
    return out_sentence

Use regular expressions for finding all occurences and values of your string (as we need to check whether is a whole word or embedded in any kind of word), by using finditer(). You might need to adjust the rx to what your definition of "whole word" is. Then get the context around these values of the size of your no_replace rule. Then check, whether the context contains your no_replace string. If not, you may replace it, by using replace() for the word only, then replace the occurence of the word in the context, then replace the context in the whole text. That way the replacing process is nearly unique and no weird behaviour should happen.

Using your examples, this leads to:

replace_whole(sen2, "sentence", "processed_sentence", "example sentence")
>>>"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."

and

replace_whole(sen1, "sentence", "processed_sentence", "example sentence")
>>>'This is an example sentence that needs to be processed into a new processed_sentence.'

Substring replacements based on replace and no-replace rules

Tags:

python

string-matching

replace

Jovan Andonov

1 Answers

MichaelJanz

Recent Activity

Donate For Us

Substring replacements based on replace and no-replace rules

Tags:

python

string-matching

replace

Jovan Andonov

1 Answers

MichaelJanz

Related questions

Recent Activity

Donate For Us