Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get n grouped categories and sum others into one

I have a table with the following structure:

Contents (
  id
  name
  desc
  tdate
  categoryid
  ...
)

I need to do some statistics with the data in this table. For example I want to get number of rows with the same category by grouping and id of that category. Also I want to limit them for n rows in descending order and if there are more categories available I want to mark them as "Others". So far I have come out with 2 queries to database:

Select n rows in descending order:

SELECT COALESCE(ca.NAME, 'Unknown') AS label
    ,ca.id AS catid
    ,COUNT(c.id) AS data
FROM contents c
LEFT OUTER JOIN category ca ON ca.id = c.categoryid
GROUP BY label
    ,catid
ORDER BY data DESC LIMIT 7

Select other rows as one:

SELECT 'Others' AS label
    ,COUNT(c.id) AS data
FROM contents c
LEFT OUTER JOIN category ca ON ca.id = c.categoryid
WHERE c.categoryid NOT IN ($INCONDITION)

But when I have no category groups left in db table I still get an "Others" record. Is it possible to make it in one query and make the "Others" record optional?

like image 229
iamawebgeek Avatar asked Apr 10 '15 11:04

iamawebgeek


1 Answers

The specific difficulty here: Queries with one or more aggregate functions in the SELECT list and no GROUP BY clause produce exactly one row, even if no row is found in the underlying table.

There is nothing you can do in the WHERE clause to suppress that row. You have to exclude such a row after the fact, i.e. in the HAVING clause, or in an outer query.

Per documentation:

If a query contains aggregate function calls, but no GROUP BY clause, grouping still occurs: the result is a single group row (or perhaps no rows at all, if the single row is then eliminated by HAVING). The same is true if it contains a HAVING clause, even without any aggregate function calls or GROUP BY clause.

It should be noted that adding a GROUP BY clause with only a constant expression (which is otherwise completely pointless!) works, too. See example below. But I'd rather not use that trick, even if it's short, cheap and simple, because it's hardly obvious what it does.

The following query only needs a single table scan and returns the top 7 categories ordered by count. If (and only if) there are more categories, the rest is summarized into 'Others':

WITH cte AS (
   SELECT categoryid, count(*) AS data
        , row_number() OVER (ORDER BY count(*) DESC, categoryid) AS rn
   FROM   contents
   GROUP  BY 1
   )
(  -- parentheses required again
SELECT categoryid, COALESCE(ca.name, 'Unknown') AS label, data
FROM   cte
LEFT   JOIN category ca ON ca.id = cte.categoryid
WHERE  rn <= 7
ORDER  BY rn
)
UNION ALL
SELECT NULL, 'Others', sum(data)
FROM   cte
WHERE  rn > 7         -- only take the rest
HAVING count(*) > 0;  -- only if there actually is a rest
-- or: HAVING  sum(data) > 0
  • You need to break ties if multiple categories can have the same count across the 7th / 8th rank. In my example, categories with the smaller categoryid win such a race.

  • Parentheses are required to include a LIMIT or ORDER BY clause to an individual leg of a UNION query.

  • You only need to join to table category for the top 7 categories. And it's generally cheaper to aggregate first and join later in this scenario. So don't join in the the base query in the CTE (common table expression) named cte, only join in the first SELECT of the UNION query, that's cheaper.

  • Not sure why you need the COALESCE. If you have a foreign key in place from contents.categoryid to category.id and both contents.categoryid and category.name are defined NOT NULL (like they probably should be), then you don't need it.

The odd GROUP BY true

This would work, too:

...

UNION ALL
SELECT NULL , 'Others', sum(data)
FROM   cte
WHERE  rn > 7
GROUP BY true; 

And I even get slightly faster query plans. But it's a rather odd hack ...

SQL Fiddle demonstrating all.

Related answer with more explanation for the UNION ALL / LIMIT technique:

  • Sum results of a few queries and then find top 5 in SQL
like image 126
Erwin Brandstetter Avatar answered Oct 15 '22 20:10

Erwin Brandstetter