Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to join tables on regex

Say I have two tables msg for messages and mnc for mobile network codes. They share no relations. But I want to join them

SELECT msg.message,
    msg.src_addr,
    msg.dst_addr,
    mnc.name,
FROM "msg"
JOIN "mnc"
ON array_to_string(regexp_matches(msg.src_addr || '+' || msg.dst_addr, '38(...)'), '') = mnc.code

But query fails with error:

psql:marketing.sql:28: ERROR:  argument of JOIN/ON must not return a set
LINE 12: ON array_to_string(regexp_matches(msg.src_addr || '+' || msg...

Is there a way to do such join? Or am I moving wrong way?

like image 343
z4y4ts Avatar asked Dec 30 '11 11:12

z4y4ts


1 Answers

A very odd way to join. Every match on one side is combined with every row from the other table ...

regexp_matches() is probably the wrong function for your purpose. You want a simple regular expression match (~). Actually, the LIKE operator will be faster:

Presumably fastest with LIKE

SELECT msg.message
     , msg.src_addr
     , msg.dst_addr
     , mnc.name
FROM   mnc
JOIN   msg ON msg.src_addr LIKE ('%38' || mnc.code || '%')
           OR msg.dst_addr LIKE ('%38' || mnc.code || '%')
WHERE  length(mnc.code) = 3;

In addition, you only want mnc.code of exactly 3 characters.

With regexp match

You could write the same with regular expressions but it will most definitely be slower. Here is a working example close to your original:

SELECT msg.message
     , msg.src_addr
     , msg.dst_addr
     , mnc.name
FROM   mnc
JOIN   msg ON (msg.src_addr || '+' || msg.dst_addr) ~ (38 || mnc.code)
           AND length(mnc.code) = 3;

This also requires msg.src_addr and msg.dst_addr to be NOT NULL.

The second query demonstrates how the additional check length(mnc.code) = 3 can go into the JOIN condition or a WHERE clause. Same effect here.

With regexp_matches()

You could make this work with regexp_matches():

SELECT msg.message
     , msg.src_addr
     , msg.dst_addr
     , mnc.name
FROM   mnc
JOIN   msg ON EXISTS (
    SELECT * 
    FROM   regexp_matches(msg.src_addr ||'+'|| msg.dst_addr, '38(...)', 'g') x(y)
    WHERE  y[1] = mnc.code
    );

But it will be slow in comparison.

Explanation:
Your regexp_matches() expression just returns an array of all captured substrings of the first match. As you only capture one substring (one pair of brackets in your pattern), you will exclusively get arrays with one element.

You get all matches with the additional "globally" switch 'g' - but in multiple rows. So you need a sub-select to test them all (or aggregate). Put that in an EXISTS - semi-join and you arrive at what you wanted.

Maybe you can report back with a performance test of all three? Use EXPLAIN ANALYZE for that.

like image 92
Erwin Brandstetter Avatar answered Oct 27 '22 20:10

Erwin Brandstetter