Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacement using multiple regexes or a bigger one in Python

Tags:

python

regex

I've switched to Python pretty recently and I'm interested to clean up a very big number of web pages (around 12k) (but can be considered just as easily text files) by removing some particular tags or some other string patterns. For this I'm using the re.sub(..) function in Python.

My question is if it's better (from the efficiency point of view) to create one big regular expression that matches more of my patterns or call the function several times with smaller and simpler regular expressions.

To exemplify, is it better to use something like

 re.sub(r"<[^<>]*>", content)
 re.sub(r"some_other_pattern", content)

or

 re.sub(r"<[^<>]*>|some_other_pattern",content)

Of course, for the sake of exemplifying the previous patterns are really simple and I haven't compiled them here, but in my real-life scenario I will.

LE: The question is not related to the HTML nature of the files, but to the behavior of Python when dealing with multiple regex patterns.

Thanks!

like image 897
Cosmin SD Avatar asked Sep 23 '12 23:09

Cosmin SD


People also ask

How to replace multiple words in a string Python?

Use the translate() method to replace multiple different characters. You can create the translation table specified in translate() by the str. maketrans() . Specify a dictionary whose key is the old character and whose value is the new string in the str.

What replaces one or many matches with a string in Python?

To replace a string in Python, the regex sub() method is used. It is a built-in Python method in re module that returns replaced string. Don't forget to import the re module. This method searches the pattern in the string and then replace it with a new given expression.

How do I search for multiple patterns in Python?

Search multiple words using regexUse | (pipe) operator to specify multiple patterns.


2 Answers

Keep it simple.

I would say that you are safer using smaller Regexes to parse through this stuff. At least that way if it behaves abnormally, you don't have to go digging to find which particular section of the massive Regex is behaving strangely. Providing you have good logging of the replacements you do, it would be trivial to determine the source of the problem, should one arise.

You don't want to run into this

like image 116
Tadgh Avatar answered Oct 05 '22 22:10

Tadgh


Speaking generally, "sequential" and "parallel" application is not the same and might produce different results, because sequential replacements can affect each other.

As to performance I guess one expression will perform better, but that's just a guess. I personally prefer to keep then complex and use "verbose" mode for readability sake.

like image 23
georg Avatar answered Oct 05 '22 21:10

georg