I can use this code below to create a new file with the substitution of a
with aa
using regular expressions.
import re with open("notes.txt") as text: new_text = re.sub("a", "aa", text.read()) with open("notes2.txt", "w") as result: result.write(new_text)
I was wondering do I have to use this line, new_text = re.sub("a", "aa", text.read())
, multiple times but substitute the string for others letters that I want to change in order to change more than one letter in my text?
That is, so a
-->aa
,b
--> bb
and c
--> cc
.
So I have to write that line for all the letters I want to change or is there an easier way. Perhaps to create a "dictionary" of translations. Should I put those letters into an array? I'm not sure how to call on them if I do.
To perform a substitution, you use the Replace method of the Regex class, instead of the Match method that we've seen in earlier articles. This method is similar to Match, except that it includes an extra string parameter to receive the replacement value.
sub() method will replace all pattern occurrences in the target string. By setting the count=1 inside a re. sub() we can replace only the first occurrence of a pattern in the target string with another string. Set the count value to the number of replacements you want to perform.
made this to find all with multiple #regular #expressions. regex1 = r"your regex here" regex2 = r"your regex here" regex3 = r"your regex here" regexList = [regex1, regex1, regex3] for x in regexList: if re. findall(x, your string): some_list = re. findall(x, your string) for y in some_list: found_regex_list.
Put a capture group around the part that you want to preserve, and then include a reference to that capture group within your replacement text. @Amber: I infer from your answer that unlike str. replace(), we can't use variables a) in raw strings; or b) as an argument to re. sub; or c) both.
The answer proposed by @nhahtdh is valid, but I would argue less pythonic than the canonical example, which uses code less opaque than his regex manipulations and takes advantage of python's built-in data structures and anonymous function feature.
A dictionary of translations makes sense in this context. In fact, that's how the Python Cookbook does it, as shown in this example (copied from ActiveState http://code.activestate.com/recipes/81330-single-pass-multiple-replace/ )
import re def multiple_replace(dict, text): # Create a regular expression from the dictionary keys regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys()))) # For each match, look-up corresponding value in dictionary return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) if __name__ == "__main__": text = "Larry Wall is the creator of Perl" dict = { "Larry Wall" : "Guido van Rossum", "creator" : "Benevolent Dictator for Life", "Perl" : "Python", } print multiple_replace(dict, text)
So in your case, you could make a dict trans = {"a": "aa", "b": "bb"}
and then pass it into multiple_replace
along with the text you want translated. Basically all that function is doing is creating one huge regex containing all of your regexes to translate, then when one is found, passing a lambda function to regex.sub
to perform the translation dictionary lookup.
You could use this function while reading from your file, for example:
with open("notes.txt") as text: new_text = multiple_replace(replacements, text.read()) with open("notes2.txt", "w") as result: result.write(new_text)
I've actually used this exact method in production, in a case where I needed to translate the months of the year from Czech into English for a web scraping task.
As @nhahtdh pointed out, one downside to this approach is that it is not prefix-free: dictionary keys that are prefixes of other dictionary keys will cause the method to break.
You can use capturing group and backreference:
re.sub(r"([characters])", r"\1\1", text.read())
Put characters that you want to double up in between []
. For the case of lower case a
, b
, c
:
re.sub(r"([abc])", r"\1\1", text.read())
In the replacement string, you can refer to whatever matched by a capturing group ()
with \n
notation where n
is some positive integer (0 excluded). \1
refers to the first capturing group. There is another notation \g<n>
where n
can be any non-negative integer (0 allowed); \g<0>
will refer to the whole text matched by the expression.
If you want to double up all characters except new line:
re.sub(r"(.)", r"\1\1", text.read())
If you want to double up all characters (new line included):
re.sub(r"(.)", r"\1\1", text.read(), 0, re.S)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With