Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL best practice: matching prefixes

Tags:

mysql

I have a table with codes and an other table with prefixes. I need to match the (longest) prefix for each code.

There is also a secondary scope in which I have to restrict prefixes (this involves bringing in other tables). I don't think this would matter in most cases, but here is a simplified (normalized) scheme (I have to set item.prefix_id):

group (id)
subgroup (id, group_id)
prefix (id, subgroup_id, prefix)
item (id, group_id, code, prefix_id)

It is allright to cache the length of the prefix in a new field and index it. It is allright to cache the group_id in prefix table (although groups are fairly small tables, in most cases I don't think any performance increase is gained). item table contains a few hundred thousand records, prefix contains at most 500.

Edit:

Sorry If the question was not defined enough. When using the word "prefix" I actually mean it, so the codes have to start with the actual prefix.

subgroup
id   group_id
-------------
1    1
2    1
3    1
4    2

prefix
id   subgroup_id  prefix
------------------------
1    1            a
2    2            abc
3    2            123
4    4            abcdef

item
id   group_id     code    prefix_id
-----------------------------------
1    1            abc123  NULL
2    1            abcdef  NULL
3    1            a123    NULL
4    2            abc123  NULL

The expected result for the prefix column is (item.id, item.prefix_id):

(1, 2) Because: subroups 1, 2, 3 are under group 1, the code abc123 starts with the the prefix a and the prefix abc and abc is the logest of the two, so we take the id of abc which is 2 and put it into item.prefix_id.

(2, 2) Because: even though prefix {4} (which is abcdef) is the logest matching prefix, it's subgroup (which is 4) is under group 2 but the item is under group 1, so we can choose from subgroups 1, 2, 3 and still abc is the logest match out of the three possible prefixes.

(3, 1) Because: a is the logest match.

(4, NULL) Because: item 4 is under group 2 and the only prefix under group 2 is abcdef which is no match to abc123 (because abc123 does not start with abcdef).

But as I said the whole groping thing is not essential part of the question. My main concern is to match a table with possible prefixes to a table of strings, and how to do it the best way. (Best meaning an optimal tradeoff between readability, maintainability and performance - hence the 'best prectice' in the title).

Currently I'm doing something like:

UPDATE item USE INDEX (code3)
    LEFT JOIN prefix ON prefix.length=3 AND LEFT(item.code,3)=prefix.prefix
    LEFT JOIN subgroup ON subgroup.id=prefix.subgroup_id
WHERE subgroup.group_id == item.group_id AND
    item.segment_id IS NULL

Where code3 is a KEY code3 (segment_id, group_id, code(3)). - And the same logic is repeate with 1, 2, 3 and 4 as length. It seems pretty efficient, but I don't like the presence of duplication in it (4 queries for a single operation). - of course this is in the case when the maximum legth of prefixes is 4.

Thanks for everyone for sharing your ideas this far.

like image 443
vbence Avatar asked Oct 11 '22 14:10

vbence


1 Answers

It is allright to cache the group_id in prefix table.

So let's create column group_id in table prefix and fill the column with the appropriate values. I assume that you know how to do this, so let's go to the next step.

The biggest performance benefit we will get from this composite index:

ALTER TABLE `prefix` ADD INDEX `c_index` (
    `group_id` ASC, 
    `prefix` ASC
);

And the UPDATE statement:

UPDATE item i
SET 
    prefix_id = (
        SELECT p.id
        FROM prefix p USE INDEX (`c_index`)
        WHERE 
            p.group_id = i.group_id AND 
            p.prefix IN (
                LEFT(i.code, 4), 
                LEFT(i.code, 3), 
                LEFT(i.code, 2), 
                LEFT(i.code, 1)
            )                
        ORDER BY LENGTH(p.prefix) DESC
        LIMIT 1        
    )

In this example I assume that prefix is variable length {1,4}. Together I decided to use IN clause instead of LIKE for to get the full benefit of c_index.

like image 68
Karolis Avatar answered Oct 20 '22 06:10

Karolis