Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slow regex with python and javascript but fast to fail in go and php

I wrote a regex to parse PostgreSQL errors to try and show to the user which field had duplicated data. The regex is this one:

^DETAIL:.[^\(]+.(.[^\)]+).[^\(]+.(.[^\)]+). already exists

It will be pretty fast if you run it against the right message like this one (https://regex101.com/r/GZuREV/1):

ERROR:  duplicate key value violates unique constraint "uq_content_block_internal_name_store_id"
DETAIL:  Key (lower(internal_name::text), store_id)=(some content block-32067683, 0c6d20a7-d843-44f3-af9c-4a2cf2a47e4c) already exists.

But if PostgreSQL emits another message like the following one, it will take python about 30 seconds in my machine to answer (https://regex101.com/r/GZuREV/2).

ERROR:  null value in column "active" violates not-null constraint
DETAIL:  Failing row contains (2018-08-16 14:23:52.214591+00, 2018-08-16 14:23:52.214591+00, null, 6f6d1bc9-c47e-46f8-b220-dae49bd58090, bf24d26e-4871-4335-9f18-83c5a52f1b3a, Some Product-a1c03dde-2de9-401c-92d5-5c1500908984, {"de_DE": "Fugit tempore voluptas quos est vitae.", "en_GB": "Qu..., {"de_DE": "Fuga reprehenderit nobis reprehenderit natus magni es..., {"de_DE": "Fuga provident dolorum. Corrupti sunt in tempore quae..., my-product-53077578, SKU-53075778, 600, 4300dc25-04e2-4193-94c0-8ee97b636739, 52553d24-6d1c-4ce6-89f9-4ad765599040, null, 38089c3c-423f-430c-b211-ab7a57dbcc13, 7d7dc30e-b06b-48b7-b674-26d4f705583b, null, {}, 0, null, 9980, 100, 1, 5).

If go to the regex101 link you can see that if you toggle to different languages like php or go, they all return quite fast saying that no match was found but if you choose python or javascript you will get a timeout.

My quick dirty fix was something like this:

match = 'already exists' in error_message and compiled_regex.search(error_message)

What do you think that could be causing that? Could it be the greedy operators to consume until I reach my desired data?

Update 1

Using re.IGNORECASE in python, makes it around 9 seconds slower as it spends too much time lowercasing something.

With ignorecase

With ignorecase

Without ignorecase

Without ignorecase

Update 2

Playing around I saw that to make it work and fail a simple change will break

^DETAIL:.[^\(]+?\((.[^\)]+?).[^\(]+?.(.[^\)]+?). already exists
                            ^ just changing this to \) make it stop timing out
^DETAIL:.[^\(]+?\((.[^\)]+?)\)[^\(]+?.(.[^\)]+?). already exists
like image 574
Jayson Reis Avatar asked Mar 06 '23 14:03

Jayson Reis


1 Answers

Package regexp

import "regexp"

Package regexp implements regular expression search.

The syntax of the regular expressions accepted is the same general syntax used by Perl, Python, and other languages. More precisely, it is the syntax accepted by RE2 and described at https://golang.org/s/re2syntax, except for \C. For an overview of the syntax, run

go doc regexp/syntax

The regexp implementation provided by this package is guaranteed to run in time linear in the size of the input. (This is a property not guaranteed by most open source implementations of regular expressions.) For more information about this property, see

http://swtch.com/~rsc/regexp/regexp1.html

or any book about automata theory.


By design, Go regular expressions are guaranteed to run in time linear in the size of the input, a property not guaranteed by some other implementations of regular expressions. See Regular Expression Matching Can Be Simple And Fast.

like image 178
peterSO Avatar answered Mar 09 '23 05:03

peterSO