Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get position of regexp match in string in PostgreSQL?

I have a table with book titles and I want to select books that have title matching a regexp and to order results by the position of the regexp match in title.

It's easy for a single-word searches. E.g.

TABLE book
id   title
1    The Sun
2    The Dead Sun
3    Sun Kissed

I'm going to put .* between words in client's search term before sending query to DB, so I'd write SQL with prepared regexps here.

SELECT book.id, book.title FROM book
    WHERE book.title ~* '.*sun.*'
    ORDER BY COALESCE(NULLIF(position('sun' in book.title), 0), 999999) ASC;

RESULT
id   title
3    Sun Kissed
1    The Sun
2    The Dead Sun

But if search term has more than one word I want to match titles that have all words from search term with anything between them, and sort by the position like before, so I need a function that returns a position of regexp, I didn't find an appropriate one in official PostgreSQL docs.

TABLE books
id   title
4    Deep Space Endeavor
5    Star Trek: Deep Space Nine: The Never Ending Sacrifice
6    Deep Black: Space Espionage and National Security

SELECT book.id, book.title FROM book
    WHERE book.title ~* '.*deep.*space.*'
    ORDER BY ???REGEXP_POSITION_FUNCTION???('.*deep.*space.*' in book.title);

DESIRED RESULT
id   title
4    Deep Space Endeavor
6    Deep Black: Space Espionage and National Security
5    Star Trek: Deep Space Nine: The Never Ending Sacrifice

I didn't find any function similar to ???REGEXP_POSITION_FUNCTION???, do you have any ideas?

like image 640
wobmene Avatar asked Jan 14 '14 01:01

wobmene


2 Answers

One way (of many) to do this: Remove the rest of the string beginning at the match and measure the length of the truncated string:

SELECT id, title
FROM   book
WHERE  title ILIKE '%deep%space%'
ORDER  BY length(regexp_replace(title, 'deep.*space.*', '','i'));

Using ILIKE in the WHERE clause, since that is typically faster (and does the same here).
Also note the fourth parameter to the regexp_replace() function ('i'), to make it case insensitive.

Alternatives

As per request in the comment.
At the same time demonstrating how to sort matches first (and NULLS LAST).

SELECT id, title
      ,substring(title FROM '(?i)(^.*)deep.*space.*') AS sub1
      ,length(substring(title FROM '(?i)(^.*)deep.*space.*')) AS pos1

      ,substring(title FROM '(?i)^.*(?=deep.*space.*)') AS sub2
      ,length(substring(title FROM '(?i)^.*(?=deep.*space.*)')) AS pos2

      ,substring(title FROM '(?i)^.*(deep.*space.*)') AS sub3
      ,position((substring(title FROM '(?i)^.*(deep.*space.*)')) IN title) AS p3

      ,regexp_replace(title, 'deep.*space.*', '','i') AS reg4
      ,length(regexp_replace(title, 'deep.*space.*', '','i')) AS pos4
FROM   book
ORDER  BY title ILIKE '%deep%space%' DESC NULLS LAST
         ,length(regexp_replace(title, 'deep.*space.*', '','i'));

You can find documentation for all of the above in the manual here and here.

-> SQLfiddle demonstrating all.

like image 85
Erwin Brandstetter Avatar answered Sep 25 '22 15:09

Erwin Brandstetter


Another way to do this would be to first get the literal match for the pattern, then find the position of the literal match:

strpos(input, (regexp_match(input, pattern, 'i'))[1]);

Or in this case:

SELECT   id, title
FROM     book
ORDER BY strpos(book.title, (regexp_match(book.title, '.*deep.*space.*', 'i'))[1]);

However, there are few caveats:

  1. this is not very efficient as it will scan the input string twice.

  2. this will ignore lookaround (lookbehind, lookahead) constraints, since the literal match can appear multiple times, before the pattern match. e.g: for the input 'aba' and pattern '(?<=b)a', strpos will return 1 (for the 1st 'a') although the actual position should be 3 (for the 2nd 'a').

BTW, you should probably use a greedy quantifier and narrow your character class as much as you can instead of .* to increase performance (e.g 'deep [\w\s]*? space')

like image 38
gilad905 Avatar answered Sep 25 '22 15:09

gilad905