Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is SQL to select a property and the max number of occurrences of a related property?

Tags:

mysql

I have a table like this:

Table: p
+----------------+
| id      | w_id |
+---------+------+
| 5       |  8   |
| 5       | 10   |
| 5       |  8   |
| 5       | 10   |
| 5       |  8   |
| 6       |  5   |
| 6       |  8   |
| 6       | 10   |
| 6       | 10   |
| 7       |  8   |
| 7       | 10   |
+----------------+

What is the best SQL to get the following result? :

+-----------------------------+
| id      | most_used_w_id    |
+---------+-------------------+
|  5      |  8                |
|  6      | 10                |
|  7      |  8                |
+-----------------------------+

In other words, to get, per id, the most frequent related w_id. Note that on the example above, id 7 is related to 8 once and to 10 once. So, either (7, 8) or (7, 10) will do as result. If it is not possible to pick up one, then both (7, 8) and (7, 10) on result set will be ok.

I have come up with something like:

select counters2.p_id as id, counters2.w_id as most_used_w_id
from (
  select p.id as p_id, 
         w_id,
         count(w_id) as count_of_w_ids
  from p
  group by id, w_id
) as counters2

join (
  select p_id, max(count_of_w_ids) as max_counter_for_w_ids
  from (
    select p.id as p_id, 
           w_id,
           count(w_id) as count_of_w_ids
    from p
    group by id, w_id
  ) as counters
  group by p_id
 ) as p_max 

on p_max.p_id = counters2.p_id
   and p_max.max_counter_for_w_ids = counters2.count_of_w_ids
;

but I am not sure at all whether this is the best way to do it. And I had to repeat the same sub-query two times.

Any better solution?

like image 456
p.matsinopoulos Avatar asked Feb 27 '14 13:02

p.matsinopoulos


2 Answers

Try to use User defined variables

select id,w_id
FROM
( select T.*,
         if(@id<>id,1,0) as row,
         @id:=id FROM
              (
               select id,W_id, Count(*) as cnt  FROM p Group by ID,W_id
              ) as T,(SELECT @id:=0) as T1
    ORDER BY id,cnt DESC
) as T2
WHERE Row=1

SQLFiddle demo

like image 51
valex Avatar answered Oct 11 '22 06:10

valex


Formal SQL

In fact - your solution is correct in terms of normal SQL. Why? Because you have to stick with joining values from original data to grouped data. Thus, your query can not be simplified. MySQL allows to mix non-group columns and group function, but that's totally unreliable, so I will not recommend you to rely on that effect.

MySQL

Since you're using MySQL, you can use variables. I'm not a big fan of them, but for your case they may be used to simplify things:

SELECT 
  c.*, 
  IF(@id!=id, @i:=1, @i:=@i+1) AS num, 
  @id:=id AS gid 
FROM 
  (SELECT id, w_id, COUNT(w_id) AS w_count 
  FROM t 
  GROUP BY id, w_id 
  ORDER BY id DESC, w_count DESC) AS c
  CROSS JOIN (SELECT @i:=-1, @id:=-1) AS init
HAVING 
  num=1;

So for your data result will look like:

+------+------+---------+------+------+
| id   | w_id | w_count | num  | gid  |
+------+------+---------+------+------+
|    7 |    8 |       1 |    1 |    7 |
|    6 |   10 |       2 |    1 |    6 |
|    5 |    8 |       3 |    1 |    5 |
+------+------+---------+------+------+

Thus, you've found your id and corresponding w_id. The idea is - to count rows and enumerate them, paying attention to the fact, that we're ordering them in subquery. So we need only first row (because it will represent data with highest count).

This may be replaced with single GROUP BY id - but, again, server is free to choose any row in that case (it will work because it will take first row, but documentation says nothing about that for common case).

One little nice thing about this is - you can select, for example, 2-nd by frequency or 3-rd, it's very flexible.

Performance

To increase performance, you can create index on (id, w_id) - obviously, it will be used for ordering and grouping records. But variables and HAVING, however, will produce line-by-line scan for set, derived by internal GROUP BY. It isn't such bad as it was with full scan of original data, but still it isn't good thing about doing this with variables. On the other hand, doing that with JOIN & subquery like in your query won't be much different, because of creating temporery table for subquery result set too.

But to be certain, you'll have to test. And keep in mind - you already have valid solution, which, by the way, isn't bound to DBMS-specific stuff and is good in terms of common SQL.

like image 21
Alma Do Avatar answered Oct 11 '22 07:10

Alma Do