Replacing string with placeholder and replacing them back after a function.

Tags:

Given a string and a list of substring that should be replaces as placeholders, e.g.

import re
from copy import copy 

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"

The first goal is to first replace the substrings from phrases in the original_text with indexed placeholders, e.g.

text = copy(original_text)
backplacement = {}
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

[out]:

Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen

Then there'll be some functions to manipulate the text with the placeholders, e.g.

cleaned_text = func('Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen')
print(cleaned_text)

that outputs:

MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2

the last step is to do the replacement we did in a backwards manner and put back the original phrases, i.e.

' '.join([backplacement[tok] if tok in backplacement else tok for tok in clean_text.split()])

[out]:

"'s_morgen ik 's-Hertogenbosch depository_financial_institution"

The questions are:

If the list of substrngs in phrases is huge, the time to do the 1st replacement and the last backplacement would take very long.

Is there a way to do the replacement/backplacement with a regex?

using the re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text) regex substitution isn't very helpful esp. if there are substrings in the phrases that matches not the full word,

E.g.

phrases = ["org", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

we get an awkward output:

Something, 's mMWEPHRASE0en, ik MWEPHRASE1 im das MWEPHRASE2 gehen

I've tried using '\b{}\b'.format(phrase) but that'll didn't work for the phrases with punctuations, i.e.

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"\b{}\b".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

[out]:

Something, 's morgen, ik 's-Hertogenbosch im das MWEPHRASE2 gehen

Is there some where to denote the word boundary for the phrases in the re.sub regex pattern?

793

asked Mar 14 '18 08:03

alvas

1 Answers

Instead of using re.sub you can split it!

def do_something_with_str(string):
    # do something with string here.
    # for example let's wrap the string with "@" symbol if it's not empty
    return f"@{string}" if string else string


def get_replaced_list(string, words):
    result = [(string, True), ]

    # we take each word we want to replace
    for w in words:

        new_result = []

        # Getting each word in old result
        for r in result:

            # Now we split every string in results using our word.
            split_list = list((x, True) for x in r[0].split(w)) if r[1] else list([r, ])

            # If we replace successfully - add all the strings
            if len(split_list) > 1:

                # This one would be for [text, replaced, text, replaced...]
                sub_result = []
                ws = [(w, False), ] * (len(split_list) - 1)
                for x, replaced in zip(split_list, ws):
                    sub_result.append(x)
                    sub_result.append(replaced)
                sub_result.append(split_list[-1])

                # Add to new result
                new_result.extend(sub_result)

            # If not - just add it to results
            else:
                new_result.extend(split_list)
        result = new_result
    return result


if __name__ == '__main__':
    initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
    words_to_replace = ('a', 'c')
    replaced_list = get_replaced_list(initial_string, words_to_replace)
    modified_list = [(do_something_with_str(x[0]), True) if x[1] else x for x in replaced_list]
    final_string = ''.join([x[0] for x in modified_list])

Here's variables values of the example above:

initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
words_to_replace = ('a', 'c')
replaced_list = [('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True)]
modified_list = [('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True)]
final_string = 'ac@bbc@bbcac@bbc@bbcac@bbc@bbca'

As you can see the lists contain tuples. They contain two values - some string and boolean, representing whether it's a text or replaced value (True when text). After you get replaced list, you can modify it as in the example, checking if it's text value (if x[1] == True). Hope that helps!

P.S. String formatting like f"some string here {some_variable_here}" requires Python 3.6

answered Sep 18 '22 07:09

Dmitry Arkhipenko

Related questions
                            
                                Text Extraction from image after detecting text region with contours
                            
                                What is event_loop_policy and why is it needed in python asyncio?
                            
                                1d CNN audio in keras
                            
                                Keras MSE definition
                            
                                How do I obtain the second highest value in a row?
                            
                                AttributeError: 'str' object has no attribute 'ndim' [closed]
                            
                                Is there a copy constructor for Map Fields in Python Protocol Buffers?
                            
                                How to convert NumPy array image to TensorFlow image?
                            
                                How to prevent float imprecision from affecting numpy.arange?
                            
                                Using attrs to turn JSONs into Python classes
                            
                                Why does pandas Series.str convert numbers to NaN?
                            
                                Can mypy handle list comprehensions?
                            
                                How to do a simple Gaussian mixture sampling and PDF plotting with NumPy/SciPy?
                            
                                KFolds Cross Validation vs train_test_split
                            
                                How to read a large JSON file using Python ijson?
                            
                                'Series' object has no attribute 'applymap'
                            
                                Getting screenshot via printwindow not redrawing if laptop screen off
                            
                                Django Admin. Disable `list_editable` fields for editing after date?
                            
                                python random.randint vs random.choice: different outcomes usingsame values
                            
                                Running background process with kubectl exec

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Replacing string with placeholder and replacing them back after a function.

Tags:

python

string

regex

replace

placeholder

alvas

People also ask

1 Answers

Dmitry Arkhipenko

Recent Activity

Donate For Us