Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Escape function for regular expression or LIKE patterns

To forgo reading the entire problem, my basic question is:
Is there a function in PostgreSQL to escape regular expression characters in a string?

I've probed the documentation but was unable to find such a function.

Here is the full problem:

In a PostgreSQL database, I have a column with unique names in it. I also have a process which periodically inserts names into this field, and, to prevent duplicates, if it needs to enter a name that already exists, it appends a space and parentheses with a count to the end.

i.e. Name, Name (1), Name (2), Name (3), etc.

As it stands, I use the following code to find the next number to add in the series (written in plpgsql):

var_name_id := 1;

SELECT CAST(substring(a.name from E'\\((\\d+)\\)$') AS int)
INTO var_last_name_id
FROM my_table.names a
WHERE a.name LIKE var_name || ' (%)'
ORDER BY CAST(substring(a.name from E'\\((\\d+)\\)$') AS int) DESC
LIMIT 1;

IF var_last_name_id IS NOT NULL THEN
    var_name_id = var_last_name_id + 1;
END IF;

var_new_name := var_name || ' (' || var_name_id || ')';

(var_name contains the name I'm trying to insert.)

This works for now, but the problem lies in the WHERE statement:

WHERE a.name LIKE var_name || ' (%)'

This check doesn't verify that the % in question is a number, and it doesn't account for multiple parentheses, as in something like "Name ((1))", and if either case existed a cast exception would be thrown.

The WHERE statement really needs to be something more like:

WHERE a.r1_name ~* var_name || E' \\(\\d+\\)'

But var_name could contain regular expression characters, which leads to the question above: Is there a function in PostgreSQL that escapes regular expression characters in a string, so I could do something like:

WHERE a.r1_name ~* regex_escape(var_name) || E' \\(\\d+\\)'

Any suggestions are much appreciated, including a possible reworking of my duplicate name solution.

like image 762
Benny Avatar asked Feb 28 '11 15:02

Benny


People also ask

How do you escape a pattern in regex?

The \ is known as the escape code, which restore the original literal meaning of the following character. Similarly, * , + , ? (occurrence indicators), ^ , $ (position anchors) have special meaning in regex. You need to use an escape code to match with these characters.

Should you escape in regex?

In order to use a literal ^ at the start or a literal $ at the end of a regex, the character must be escaped. Some flavors only use ^ and $ as metacharacters when they are at the start or end of the regex respectively. In those flavors, no additional escaping is necessary. It's usually just best to escape them anyway.

What does escape mean in regex?

Now, escaping a string (in regex terms) means finding all of the characters with special meaning and putting a backslash in front of them, including in front of other backslash characters. When you've done this one time on the string, you have officially "escaped the string".

What does ?= Mean in regular expression?

?= is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).


1 Answers

To address the question at the top:

Assuming standard_conforming_strings = on, like it's default since Postgres 9.1.

Regular expression escape function

Let's start with a complete list of characters with special meaning in regular expression patterns:

!$()*+.:<=>?[\]^{|}-

Wrapped in a bracket expression most of them lose their special meaning - with a few exceptions:

  • - needs to be first or last or it signifies a range of characters.
  • ] and \ have to be escaped with \ (in the replacement, too).

After adding capturing parentheses for the back reference below we get this regexp pattern:

([!$()*+.:<=>?[\\\]^{|}-])

Using it, this function escapes all special characters with a backslash (\) - thereby removing the special meaning:

CREATE OR REPLACE FUNCTION f_regexp_escape(text)
  RETURNS text
  LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
$func$
SELECT regexp_replace($1, '([!$()*+.:<=>?[\\\]^{|}-])', '\\\1', 'g')
$func$;

Add PARALLEL SAFE (because it is) in Postgres 10 or later to allow parallelism for queries using it.

Demo

SELECT f_regexp_escape('test(1) > Foo*');

Returns:

test\(1\) \> Foo\*

And while:

SELECT 'test(1) > Foo*' ~ 'test(1) > Foo*';

returns FALSE, which may come as a surprise to naive users,

SELECT 'test(1) > Foo*' ~ f_regexp_escape('test(1) > Foo*');

Returns TRUE as it should now.

LIKE escape function

For completeness, the pendant for LIKE patterns, where only three characters are special:

\%_

The manual:

The default escape character is the backslash but a different one can be selected by using the ESCAPE clause.

This function assumes the default:

CREATE OR REPLACE FUNCTION f_like_escape(text)
  RETURNS text
  LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
$func$
SELECT replace(replace(replace($1
         , '\', '\\')  -- must come 1st
         , '%', '\%')
         , '_', '\_');
$func$;

We could use the more elegant regexp_replace() here, too, but for the few characters, a cascade of replace() functions is faster.

Again, PARALLEL SAFE in Postgres 10 or later.

Demo

SELECT f_like_escape('20% \ 50% low_prices');

Returns:

20\% \\ 50\% low\_prices
like image 164
Erwin Brandstetter Avatar answered Sep 24 '22 16:09

Erwin Brandstetter