Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient method determining if a list of values completely satisfy a one to many relationship (MySQL)

I have a one-to-many relationship of rooms and their occupants:

Room | User
1    | 1
1    | 2
1    | 4
2    | 1
2    | 2
2    | 3
2    | 5
3    | 1
3    | 3

Given a list of users, e.g. 1, 3, what is the most efficient way to determining which room is completely/perfectly filled by them? So in this case, it should return room 3 because, although they are both in room 2, room 2 has other occupants as well, which is not a "perfect" fit.

I can think of several solutions to this, but am not sure about the efficiency. For example, I can do a group concatenate on the user (ordered ascending) grouping by room, which will give me comma separated strings such as "1,2,4", "1,2,3,5" and "1,3". I can then order my input list ascending and look for a perfect match to "1,3".

Or I can do a count of the total number of users in a room AND containing both users 1 and 3. I will then select the room which has the count of users equal to two.

Note I want to most efficient way, or at least a way that scales up to millions of users and rooms. Each room will have around 25 users. Another thing I want to consider is how to pass this list to the database. Should I construct a query by concatenating AND userid = 1 AND userid = 3 AND userid = 5 and so on? Or is there a way to pass the values as an array into a stored procedure?

Any help would be appreciated.

like image 853
l3utterfly Avatar asked Oct 31 '22 06:10

l3utterfly


1 Answers

For example, I can do a group concatenate on the user (ordered ascending) grouping by room, which will give me comma separated strings such as "1,2,4", "1,2,3,5" and "1,3". I can then order my input list ascending and look for a perfect match to "1,3".

First, a word of advice, to improve your level of function as a developer. Stop thinking of the data, and of the solution, in terms of CSVs. It limits you to thinking in spreadsheet terms, and prevents you from thinking in Relational Data terms. You do not need to construct strings, and then match strings, when the data is in the database, you can match it there.

Solution

Now then, in Relational data terms, what exactly do you want ? You want the rooms where the count of users that match your argument user list is highest. Is that correct ? If so, the code is simple.

You haven't given the tables. I will assume room, user, room_user, with deadly ids on the first two, and a composite key on the third. I can give you the SQL solution, you will have to work out how to do it in the non-SQL.

Another thing I want to consider is how to pass this list to the database. Should I construct a query by concatenating AND userid = 1 AND userid = 3 AND userid = 5 and so on? Or is there a way to pass the values as an array into a stored procedure?

  1. To pass the list to the stored proc, because it needs a single calling parm, the length of which is variable, you have to create a CSV list of users. Let's call that parm @user_list. (Note, that is not contemplating the data, that is passing a list to a proc in a single parm, because you can't pass an unknown number of identified users to a proc otherwise.)

  2. Since you constructed the @user_list on the client, you may as well compute @user_count (the number of members in the list) while you are at it, on the client, and pass that to the proc.

Something like:

CREATE PROC room_user_match_sp (
    @user_list    CHAR(255),
    @user_count   INT
    ...
    )
AS
    -- validate parms, etc
    ...
SELECT  room_id,
        match_count,
        match_count / @user_count * 100 AS match_pct
    FROM  (
        SELECT  room_id,
                COUNT(user_id) AS match_count -- no of users matched
            FROM room_user
            WHERE user_id IN ( @user_list )
            GROUP BY room_id                  -- get one row per room
            ) AS match_room                   -- has any matched users
    WHERE match_count = MAX( match_count )    -- remove this while testing

It is not clear, if you want full matches only. In that case, use:

    WHERE match_count = @user_count

Expectation

You have asked for a proc-based solution, so I have given that. Yes, it is the fastest. But keep in mind that for this kind of requirement and solution, you could construct the SQL string on the client, and execute it on the "server" in the usual manner, without using a proc. The proc is faster here only because the code is compiled and that step is removed, as opposed to that step being performed every time the client calls the "server" with the SQL string.

The point I am making here is, with the data in a reasonably Relational form, you can obtain the result you are seeking using a single SELECT statement, you don't have to mess around with work tables or temp tables or intermediate steps, which requires a proc. Here, the proc is not required, you are implementing a proc for performance reasons.

I make this point because it is clear from your question that your expectation of the solution is "gee, I can't get the result directly, I have work with the data first, I am ready and willing to do that". Such intermediate work steps are required only when the data is not Relational.

like image 169
PerformanceDBA Avatar answered Nov 15 '22 06:11

PerformanceDBA