Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I do multiple substitutions using regex?

I can use this code below to create a new file with the substitution of a with aa using regular expressions.

import re  with open("notes.txt") as text:     new_text = re.sub("a", "aa", text.read())     with open("notes2.txt", "w") as result:         result.write(new_text) 

I was wondering do I have to use this line, new_text = re.sub("a", "aa", text.read()), multiple times but substitute the string for others letters that I want to change in order to change more than one letter in my text?

That is, so a-->aa,b--> bb and c--> cc.

So I have to write that line for all the letters I want to change or is there an easier way. Perhaps to create a "dictionary" of translations. Should I put those letters into an array? I'm not sure how to call on them if I do.

like image 937
Euridice01 Avatar asked Mar 02 '13 13:03

Euridice01


People also ask

How do you use substitution in regex?

To perform a substitution, you use the Replace method of the Regex class, instead of the Match method that we've seen in earlier articles. This method is similar to Match, except that it includes an extra string parameter to receive the replacement value.

How do you replace all occurrences of a regex pattern in a string Python?

sub() method will replace all pattern occurrences in the target string. By setting the count=1 inside a re. sub() we can replace only the first occurrence of a pattern in the target string with another string. Set the count value to the number of replacements you want to perform.

How do you use two regular expressions in Python?

made this to find all with multiple #regular #expressions. regex1 = r"your regex here" regex2 = r"your regex here" regex3 = r"your regex here" regexList = [regex1, regex1, regex3] for x in regexList: if re. findall(x, your string): some_list = re. findall(x, your string) for y in some_list: found_regex_list.

How do I replace only part of a match with Python re sub?

Put a capture group around the part that you want to preserve, and then include a reference to that capture group within your replacement text. @Amber: I infer from your answer that unlike str. replace(), we can't use variables a) in raw strings; or b) as an argument to re. sub; or c) both.


2 Answers

The answer proposed by @nhahtdh is valid, but I would argue less pythonic than the canonical example, which uses code less opaque than his regex manipulations and takes advantage of python's built-in data structures and anonymous function feature.

A dictionary of translations makes sense in this context. In fact, that's how the Python Cookbook does it, as shown in this example (copied from ActiveState http://code.activestate.com/recipes/81330-single-pass-multiple-replace/ )

import re   def multiple_replace(dict, text):   # Create a regular expression  from the dictionary keys   regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))    # For each match, look-up corresponding value in dictionary   return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)   if __name__ == "__main__":     text = "Larry Wall is the creator of Perl"    dict = {     "Larry Wall" : "Guido van Rossum",     "creator" : "Benevolent Dictator for Life",     "Perl" : "Python",   }     print multiple_replace(dict, text) 

So in your case, you could make a dict trans = {"a": "aa", "b": "bb"} and then pass it into multiple_replace along with the text you want translated. Basically all that function is doing is creating one huge regex containing all of your regexes to translate, then when one is found, passing a lambda function to regex.sub to perform the translation dictionary lookup.

You could use this function while reading from your file, for example:

with open("notes.txt") as text:     new_text = multiple_replace(replacements, text.read()) with open("notes2.txt", "w") as result:     result.write(new_text) 

I've actually used this exact method in production, in a case where I needed to translate the months of the year from Czech into English for a web scraping task.

As @nhahtdh pointed out, one downside to this approach is that it is not prefix-free: dictionary keys that are prefixes of other dictionary keys will cause the method to break.

like image 80
Emmett Butler Avatar answered Oct 01 '22 21:10

Emmett Butler


You can use capturing group and backreference:

re.sub(r"([characters])", r"\1\1", text.read()) 

Put characters that you want to double up in between []. For the case of lower case a, b, c:

re.sub(r"([abc])", r"\1\1", text.read()) 

In the replacement string, you can refer to whatever matched by a capturing group () with \n notation where n is some positive integer (0 excluded). \1 refers to the first capturing group. There is another notation \g<n> where n can be any non-negative integer (0 allowed); \g<0> will refer to the whole text matched by the expression.


If you want to double up all characters except new line:

re.sub(r"(.)", r"\1\1", text.read()) 

If you want to double up all characters (new line included):

re.sub(r"(.)", r"\1\1", text.read(), 0, re.S) 
like image 29
nhahtdh Avatar answered Oct 01 '22 19:10

nhahtdh