Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing string with placeholder and replacing them back after a function.

Given a string and a list of substring that should be replaces as placeholders, e.g.

import re
from copy import copy 

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"

The first goal is to first replace the substrings from phrases in the original_text with indexed placeholders, e.g.

text = copy(original_text)
backplacement = {}
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

[out]:

Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen

Then there'll be some functions to manipulate the text with the placeholders, e.g.

cleaned_text = func('Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen')
print(cleaned_text)

that outputs:

MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2

the last step is to do the replacement we did in a backwards manner and put back the original phrases, i.e.

' '.join([backplacement[tok] if tok in backplacement else tok for tok in clean_text.split()])

[out]:

"'s_morgen ik 's-Hertogenbosch depository_financial_institution"

The questions are:

  1. If the list of substrngs in phrases is huge, the time to do the 1st replacement and the last backplacement would take very long.

Is there a way to do the replacement/backplacement with a regex?

  1. using the re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text) regex substitution isn't very helpful esp. if there are substrings in the phrases that matches not the full word,

E.g.

phrases = ["org", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

we get an awkward output:

Something, 's mMWEPHRASE0en, ik MWEPHRASE1 im das MWEPHRASE2 gehen

I've tried using '\b{}\b'.format(phrase) but that'll didn't work for the phrases with punctuations, i.e.

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"\b{}\b".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

[out]:

Something, 's morgen, ik 's-Hertogenbosch im das MWEPHRASE2 gehen

Is there some where to denote the word boundary for the phrases in the re.sub regex pattern?

like image 793
alvas Avatar asked Mar 14 '18 08:03

alvas


People also ask

How do you replace a placeholder in a string?

Substitute placeholder using the string replace method Then the matching values for the regular expression can be replaced by checking with userDetails object as in below. Other than the matching placeholder text, you can access index, captured text in this function.

How do you replace a placeholder in a string in Python?

Python String format() is a function used to replace, substitute, or convert the string with placeholders with valid values in the final string. It is a built-in function of the Python string class, which returns the formatted string as an output. The placeholders inside the string are defined in curly brackets.

How do you replace a part of a string with something else?

If you'd like to replace a substring with another string, simply use the REPLACE function. This function takes three arguments: The string to change (which in our case was a column). The substring to replace.


1 Answers

Instead of using re.sub you can split it!

def do_something_with_str(string):
    # do something with string here.
    # for example let's wrap the string with "@" symbol if it's not empty
    return f"@{string}" if string else string


def get_replaced_list(string, words):
    result = [(string, True), ]

    # we take each word we want to replace
    for w in words:

        new_result = []

        # Getting each word in old result
        for r in result:

            # Now we split every string in results using our word.
            split_list = list((x, True) for x in r[0].split(w)) if r[1] else list([r, ])

            # If we replace successfully - add all the strings
            if len(split_list) > 1:

                # This one would be for [text, replaced, text, replaced...]
                sub_result = []
                ws = [(w, False), ] * (len(split_list) - 1)
                for x, replaced in zip(split_list, ws):
                    sub_result.append(x)
                    sub_result.append(replaced)
                sub_result.append(split_list[-1])

                # Add to new result
                new_result.extend(sub_result)

            # If not - just add it to results
            else:
                new_result.extend(split_list)
        result = new_result
    return result


if __name__ == '__main__':
    initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
    words_to_replace = ('a', 'c')
    replaced_list = get_replaced_list(initial_string, words_to_replace)
    modified_list = [(do_something_with_str(x[0]), True) if x[1] else x for x in replaced_list]
    final_string = ''.join([x[0] for x in modified_list])

Here's variables values of the example above:

initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
words_to_replace = ('a', 'c')
replaced_list = [('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True)]
modified_list = [('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True)]
final_string = 'ac@bbc@bbcac@bbc@bbcac@bbc@bbca'

As you can see the lists contain tuples. They contain two values - some string and boolean, representing whether it's a text or replaced value (True when text). After you get replaced list, you can modify it as in the example, checking if it's text value (if x[1] == True). Hope that helps!

P.S. String formatting like f"some string here {some_variable_here}" requires Python 3.6

like image 54
Dmitry Arkhipenko Avatar answered Sep 18 '22 07:09

Dmitry Arkhipenko