Given a string and a list of substring that should be replaces as placeholders, e.g.
import re
from copy import copy
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
The first goal is to first replace the substrings from phrases
in the original_text
with indexed placeholders, e.g.
text = copy(original_text)
backplacement = {}
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
[out]:
Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen
Then there'll be some functions to manipulate the text
with the placeholders, e.g.
cleaned_text = func('Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen')
print(cleaned_text)
that outputs:
MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2
the last step is to do the replacement we did in a backwards manner and put back the original phrases, i.e.
' '.join([backplacement[tok] if tok in backplacement else tok for tok in clean_text.split()])
[out]:
"'s_morgen ik 's-Hertogenbosch depository_financial_institution"
The questions are:
phrases
is huge, the time to do the 1st replacement and the last backplacement would take very long. Is there a way to do the replacement/backplacement with a regex?
re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
regex substitution isn't very helpful esp. if there are substrings in the phrases that matches not the full word, E.g.
phrases = ["org", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
we get an awkward output:
Something, 's mMWEPHRASE0en, ik MWEPHRASE1 im das MWEPHRASE2 gehen
I've tried using '\b{}\b'.format(phrase)
but that'll didn't work for the phrases with punctuations, i.e.
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"\b{}\b".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
[out]:
Something, 's morgen, ik 's-Hertogenbosch im das MWEPHRASE2 gehen
Is there some where to denote the word boundary for the phrases in the re.sub
regex pattern?
Substitute placeholder using the string replace method Then the matching values for the regular expression can be replaced by checking with userDetails object as in below. Other than the matching placeholder text, you can access index, captured text in this function.
Python String format() is a function used to replace, substitute, or convert the string with placeholders with valid values in the final string. It is a built-in function of the Python string class, which returns the formatted string as an output. The placeholders inside the string are defined in curly brackets.
If you'd like to replace a substring with another string, simply use the REPLACE function. This function takes three arguments: The string to change (which in our case was a column). The substring to replace.
Instead of using re.sub you can split it!
def do_something_with_str(string):
# do something with string here.
# for example let's wrap the string with "@" symbol if it's not empty
return f"@{string}" if string else string
def get_replaced_list(string, words):
result = [(string, True), ]
# we take each word we want to replace
for w in words:
new_result = []
# Getting each word in old result
for r in result:
# Now we split every string in results using our word.
split_list = list((x, True) for x in r[0].split(w)) if r[1] else list([r, ])
# If we replace successfully - add all the strings
if len(split_list) > 1:
# This one would be for [text, replaced, text, replaced...]
sub_result = []
ws = [(w, False), ] * (len(split_list) - 1)
for x, replaced in zip(split_list, ws):
sub_result.append(x)
sub_result.append(replaced)
sub_result.append(split_list[-1])
# Add to new result
new_result.extend(sub_result)
# If not - just add it to results
else:
new_result.extend(split_list)
result = new_result
return result
if __name__ == '__main__':
initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
words_to_replace = ('a', 'c')
replaced_list = get_replaced_list(initial_string, words_to_replace)
modified_list = [(do_something_with_str(x[0]), True) if x[1] else x for x in replaced_list]
final_string = ''.join([x[0] for x in modified_list])
Here's variables values of the example above:
initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
words_to_replace = ('a', 'c')
replaced_list = [('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True)]
modified_list = [('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True)]
final_string = 'ac@bbc@bbcac@bbc@bbcac@bbc@bbca'
As you can see the lists contain tuples. They contain two values - some string
and boolean
, representing whether it's a text or replaced value (True
when text).
After you get replaced list, you can modify it as in the example, checking if it's text value (if x[1] == True
).
Hope that helps!
P.S. String formatting like f"some string here {some_variable_here}"
requires Python 3.6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With