I have a table with book titles and I want to select books that have title matching a regexp and to order results by the position of the regexp match in title.
It's easy for a single-word searches. E.g.
TABLE book
id title
1 The Sun
2 The Dead Sun
3 Sun Kissed
I'm going to put .*
between words in client's search term before sending query to DB, so I'd write SQL with prepared regexps here.
SELECT book.id, book.title FROM book
WHERE book.title ~* '.*sun.*'
ORDER BY COALESCE(NULLIF(position('sun' in book.title), 0), 999999) ASC;
RESULT
id title
3 Sun Kissed
1 The Sun
2 The Dead Sun
But if search term has more than one word I want to match titles that have all words from search term with anything between them, and sort by the position like before, so I need a function that returns a position of regexp, I didn't find an appropriate one in official PostgreSQL docs.
TABLE books
id title
4 Deep Space Endeavor
5 Star Trek: Deep Space Nine: The Never Ending Sacrifice
6 Deep Black: Space Espionage and National Security
SELECT book.id, book.title FROM book
WHERE book.title ~* '.*deep.*space.*'
ORDER BY ???REGEXP_POSITION_FUNCTION???('.*deep.*space.*' in book.title);
DESIRED RESULT
id title
4 Deep Space Endeavor
6 Deep Black: Space Espionage and National Security
5 Star Trek: Deep Space Nine: The Never Ending Sacrifice
I didn't find any function similar to ???REGEXP_POSITION_FUNCTION???, do you have any ideas?
One way (of many) to do this: Remove the rest of the string beginning at the match and measure the length of the truncated string:
SELECT id, title
FROM book
WHERE title ILIKE '%deep%space%'
ORDER BY length(regexp_replace(title, 'deep.*space.*', '','i'));
Using ILIKE
in the WHERE clause, since that is typically faster (and does the same here).
Also note the fourth parameter to the regexp_replace()
function ('i'
), to make it case insensitive.
As per request in the comment.
At the same time demonstrating how to sort matches first (and NULLS LAST
).
SELECT id, title
,substring(title FROM '(?i)(^.*)deep.*space.*') AS sub1
,length(substring(title FROM '(?i)(^.*)deep.*space.*')) AS pos1
,substring(title FROM '(?i)^.*(?=deep.*space.*)') AS sub2
,length(substring(title FROM '(?i)^.*(?=deep.*space.*)')) AS pos2
,substring(title FROM '(?i)^.*(deep.*space.*)') AS sub3
,position((substring(title FROM '(?i)^.*(deep.*space.*)')) IN title) AS p3
,regexp_replace(title, 'deep.*space.*', '','i') AS reg4
,length(regexp_replace(title, 'deep.*space.*', '','i')) AS pos4
FROM book
ORDER BY title ILIKE '%deep%space%' DESC NULLS LAST
,length(regexp_replace(title, 'deep.*space.*', '','i'));
You can find documentation for all of the above in the manual here and here.
-> SQLfiddle demonstrating all.
Another way to do this would be to first get the literal match for the pattern, then find the position of the literal match:
strpos(input, (regexp_match(input, pattern, 'i'))[1]);
Or in this case:
SELECT id, title
FROM book
ORDER BY strpos(book.title, (regexp_match(book.title, '.*deep.*space.*', 'i'))[1]);
However, there are few caveats:
this is not very efficient as it will scan the input string twice.
this will ignore lookaround (lookbehind, lookahead) constraints, since the literal match can appear multiple times, before the pattern match.
e.g: for the input 'aba'
and pattern '(?<=b)a'
, strpos
will return 1
(for the 1st 'a'
) although the actual position should be 3
(for the 2nd 'a'
).
BTW, you should probably use a greedy quantifier and narrow your character class as much as you can instead of .*
to increase performance (e.g 'deep [\w\s]*? space'
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With