Is it safe to use user input for Python's regular expressions?

Tags:

I would like to let my users use regular expressions for some features. I'm curious what the implications are of passing user input to re.compile(). I assume there is no way for a user to give me a string that could let them execute arbitrary code. The dangers I have thought of are:

The user could pass input that raises an exception.
- The user could pass input that causes the regex engine to take a long time, or to use a lot of memory.

The solution to 1. is easy: catch exceptions. I'm not sure if there is a good solution to 2. Perhaps just limiting the length of the regex would work.

Is there anything else I need to worry about?

270

asked Jan 04 '10 07:01

Skeletron

2 Answers

I have worked on a program that allows users to enter their own regex and you are right - they can (and do) enter regex that can take a long time to finish - sometimes longer than than the lifetime of the universe. What is worse, while processing a regex Python holds the GIL, so it will not only hang the thread that is running the regex, but the entire program.

Limiting the length of the regex will not work, since the problem is backtracking. For example, matching the regex r"(\S+)+x" on a string of length N that does not contain an "x" will backtrack 2**N times. On my system this takes about a second to match against "a"*21 and the time doubles for each additional character, so a string of 100 characters would take approximately 19167393131891000 years to complete (this is an estimate, I have not timed it).

For more information read the O'Reilly book "Mastering Regular Expressions" - this has a couple of chapters on performance.

edit To get round this we wrote a regex analysing function that tried to catch and reject some of the more obvious degenerate cases, but it is impossible to get all of them.

Another thing we looked at was patching the re module to raise an exception if it backtracks too many times. This is possible, but requires changing the Python C source and recompiling, so is not portable. We also submitted a patch to release the GIL when matching against python strings, but I don't think it was accepted into the core (python only holds the GIL because regex can be run against mutable buffers).

129

answered Sep 21 '22 05:09

Dave Kirby

It's much simpler for casual users to give them a subset language. The shell's globbing rules in fnmatch, for example. The SQL LIKE condition rules are another example.

Translate the user's language into a proper regex for execution at runtime.

answered Sep 19 '22 05:09

S.Lott

Related questions
                            
                                Python RandomForest - Unknown label Error
                            
                                How to speed up API requests?
                            
                                How to change the line color in seaborn lmplot
                            
                                TensorFlow/TFLearn: ValueError: Cannot feed value of shape (64,) for Tensor u'target/Y:0', which has shape '(?, 10)'
                            
                                pandas df.loc[z,x]=y how to improve speed?
                            
                                brew postinstall python (2.7.13): [Errno 13] Permission denied
                            
                                How to install the 'glob' module?
                            
                                Opencv-Python cv2.CV_CAP_PROP_FPS error
                            
                                how to plot multiple time series in chartjs where each time series has different times
                            
                                How can I standardize only numeric variables in an sklearn pipeline?
                            
                                How to check if Pandas column has value from list of string?
                            
                                Python: Convert dictionary to bytes
                            
                                How to use yapf (or black) in VSCode
                            
                                decompose() for time series: ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None
                            
                                Error converting object (string) to Int32: TypeError: object cannot be converted to an IntegerDtype
                            
                                Efficiently match multiple regexes in Python
                            
                                Is this the best way to get unique version of filename w/ Python?
                            
                                Python IDE built into Visual Studio 2008?
                            
                                BOO Vs IronPython
                            
                                Multiple simultaneous network connections - Telnet server, Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it safe to use user input for Python's regular expressions?

Tags:

python

regex

user-input

sanitize

Skeletron

People also ask

2 Answers

Dave Kirby

S.Lott

Recent Activity

Donate For Us