Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add columns to a query without the need to put them in the group by?

I have a table to save a process.

Every process is composed by items and every item has its values that are collected during the process.

A process is executed by a client.

Here is the sample database scheme with dummy data: http://sqlfiddle.com/#!15/36af4

There is some info that I need to extract from these tables:

  • The Item ID
  • The item life
  • The total process per item per life
  • The oldest process per item per life (timestamp)
  • The newest process per item per life (timestamp)
  • The max item value per item per life
  • The min item value per item per life

I also need to find the process ID and client ID of the newest and oldest process of an item per life:

  • The process ID of the oldest process per item per life
  • The process ID of the newest process per item per life
  • The client ID of the oldest process per item per life
  • The client ID of the newest process per item per life

Please note, that for a specific item in a specific life, the process ID of the oldest process could not match the process ID of the min value of that item in that life.

I need all this info for every life of every item. An item can have a process in different clients so I cannot group by client because this would make a possible duplicate of the item. The same goes for the processes, because an item can be present in different processes I cannot group by process too.

This is the most far I could go for myself:

 SELECT
  PV.ID_ITEM                                                       AS ID_ITEM,
  PV.ITEM_LIFE                                                     AS LIFE,
  COUNT(PV.ID_ITEM)                                                AS TOTAL_ITEM_PROCESS,
  MIN(P.DATE_TIME)                                                AS OLDEST_PROCESS,
  MAX(P.DATE_TIME)                                                AS NEWEST_PROCESS,
  MAX(GREATEST(PV.ITEM_VALUE_1, PV.ITEM_VALUE_2, PV.ITEM_VALUE_3)) AS MAX_ITEM_VALUE,
  MIN(LEAST(PV.ITEM_VALUE_1, PV.ITEM_VALUE_2, PV.ITEM_VALUE_3))    AS MIN_ITEM_VALUE
FROM PROCESS P
  JOIN PROCESS_VALUES PV ON P._ID = PV.ID_PROCESS
GROUP BY PV.ID_ITEM, PV.ITEM_LIFE;

But I have no idea how I can add to this query the client and process IDs of the oldest and newest process without having to add these same columns in the group by clause. If I add the client ID in the group by, for instance, some items will be duplicated if there is a process for them in different clients.

We can not use MAX or MIN to get the process ID because the returned ID could not match the timestamp collected. Not always the higher ID has the newest timestamp.

For the data provided in the fiddle, this should be the output:

+---------+-----------+-----------------------------------+----------------------------------+----------------------------------+----------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+
| ITEM ID | ITEM LIFE | TOTAL PROCESSES PER ITEM PER LIFE | OLDEST PROCESS PER ITEM PER LIFE | NEWEST PROCESS PER ITEM PER LIFE | MAX ITEM VALUE PER ITEM PER LIFE | MIN ITEM VALUE PER ITEM PER LIFE | PROCESS ID OF THE OLDEST PROCESS | PROCESS ID OF THE NEWEST PROCESS | CLIENT ID OF THE OLDEST PROCESS | CLIENT ID OF THE NEWEST PROCESS |
+---------+-----------+-----------------------------------+----------------------------------+----------------------------------+----------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+
|     230 |         1 |                                 1 |  '2018-01-01 10:00:00'           |  '2018-01-02 10:00:00'           |                              6.5 |                              1.5 |                                1 |                                2 |                             100 |                             100 |
|     230 |         2 |                                 1 |  '2018-01-01 10:00:00'           |  '2018-01-02 10:00:00'           |                              5.5 |                              2.5 |                                1 |                                2 |                             100 |                             100 |
|     231 |         1 |                                 1 |  '2018-01-01 10:00:00'           |  '2018-01-02 10:00:00'           |                              7.5 |                              1.5 |                                1 |                                2 |                             100 |                             100 |
|     231 |         2 |                                 1 |  '2018-01-01 10:00:00'           |  '2018-01-02 10:00:00'           |                             10.8 |                              4.5 |                                1 |                                2 |                             100 |                             100 |
|     232 |         1 |                                 1 |  '2018-01-01 10:00:00'           |  '2018-01-02 10:00:00'           |                              5.6 |                              0.5 |                                1 |                                2 |                             100 |                             100 |
|     232 |         2 |                                 1 |  '2018-01-01 10:00:00'           |  '2018-01-02 10:00:00'           |                              2.5 |                             25.5 |                                1 |                                2 |                             100 |                             100 |
|     530 |         1 |                                 2 |  '2018-01-05 13:00:00'           |  '2018-01-06 13:00:00'           |                             11.5 |                              1.5 |                                4 |                                3 |                             400 |                             300 |
|     531 |         1 |                                 2 |  '2018-01-05 13:00:00'           |  '2018-01-06 13:00:00'           |                              9.5 |                              1.5 |                                4 |                                3 |                             400 |                             300 |
|     532 |         1 |                                 2 |  '2018-01-05 13:00:00'           |  '2018-01-06 13:00:00'           |                             13.5 |                              4.5 |                                4 |                                3 |                             400 |                             300 |
+---------+-----------+-----------------------------------+----------------------------------+----------------------------------+----------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+

How can I add more columns to the query without the need of put them in the group by?

We use PostgreSQL.

like image 969
Luiz Avatar asked Aug 10 '18 21:08

Luiz


People also ask

Do all SELECT columns need to be in GROUP BY?

Answer. No, you can GROUP BY a column that was not included in the SELECT statement.

Can SQL have HAVING Without GROUP BY?

Having can be used without groupby clause,in aggregate function,in that case it behaves like where clause. groupby can be used without having clause with the select statement. 3. The having clause can contain aggregate functions.


1 Answers

It looks like a common problem top-n-per-group with a slight twist.

Here is one way to do it, where I use the ROW_NUMBER method. Another method is to use lateral join. Which method is faster depends on the data distribution.

You can read through very detailed answers on dba.se Retrieving n rows per group. (That question is for SQL Server, but Postgres has all the same features, so all answers there apply to Postgres).

SQL Fiddle

I adjusted your sample data and made all ID_CLIENT different to be able to see them in the results.

CREATE TABLE PROCESS (
  _ID BIGINT,
  DATE_TIME TIMESTAMP WITH TIME ZONE,
  ID_CLIENT BIGINT
);

 CREATE TABLE PROCESS_VALUES (
   ID_PROCESS BIGINT,
   ID_ITEM BIGINT,
   ITEM_LIFE INTEGER,
   ITEM_VALUE_1 REAL,
   ITEM_VALUE_2 REAL,
   ITEM_VALUE_3 REAL
 );

INSERT INTO PROCESS VALUES(1, '2018-01-01 10:00:00', 100);
INSERT INTO PROCESS VALUES(2, '2018-01-02 10:00:00', 200);
INSERT INTO PROCESS VALUES(3, '2018-01-06 13:00:00', 300);
INSERT INTO PROCESS VALUES(4, '2018-01-05 13:00:00', 400);

INSERT INTO PROCESS_VALUES VALUES(1, 230, 1, 5.5, 6.5, 1.5);
INSERT INTO PROCESS_VALUES VALUES(1, 231, 1, 1.5, 7.5, 3.5);
INSERT INTO PROCESS_VALUES VALUES(1, 232, 1, 5.6, 3.5, 0.5);

INSERT INTO PROCESS_VALUES VALUES(2, 230, 2, 5.5, 2.5, 4.5);
INSERT INTO PROCESS_VALUES VALUES(2, 231, 2, 10.8, 6.5, 4.5);
INSERT INTO PROCESS_VALUES VALUES(2, 232, 2, 25.5, 6.5, 2.5);

INSERT INTO PROCESS_VALUES VALUES(3, 530, 1, 1.5, 6.5, 8.5);
INSERT INTO PROCESS_VALUES VALUES(3, 531, 1, 3.5, 6.5, 1.5);
INSERT INTO PROCESS_VALUES VALUES(3, 532, 1, 6.5, 7.0, 4.5);

INSERT INTO PROCESS_VALUES VALUES(4, 530, 1, 1.5, 11.5, 4.5);
INSERT INTO PROCESS_VALUES VALUES(4, 531, 1, 9.5, 8.5, 1.5);
INSERT INTO PROCESS_VALUES VALUES(4, 532, 1, 5.5, 13.5, 4.5);

Query

I wrote the query using CTEs to make it readable. If you want you can inline them all into one huge query. You should try and test what runs faster.

CTE_RN is the original two tables joined together with two sets of row numbers, one for oldest processes and one for newest (this is what ORDER BY DATE_TIME ASC/DESC is for); per item, per life (this is what PARTITION BY is for).

CTE_OLDEST leaves only oldest processes.

CTE_NEWEST leaves only newest processes.

CTE_GROUPS is your query from the question that calculates all summaries.

Final SELECT joins summaries with info about oldest and newest processes.

WITH
CTE_RN
AS
(
    SELECT
        PROCESS.DATE_TIME
        ,PROCESS._ID AS ID_PROCESS
        ,PROCESS.ID_CLIENT
        ,PROCESS_VALUES.ID_ITEM
        ,PROCESS_VALUES.ITEM_LIFE
        ,PROCESS_VALUES.ITEM_VALUE_1
        ,PROCESS_VALUES.ITEM_VALUE_2
        ,PROCESS_VALUES.ITEM_VALUE_3
        ,ROW_NUMBER() OVER (PARTITION BY PROCESS_VALUES.ID_ITEM, PROCESS_VALUES.ITEM_LIFE 
                            ORDER BY PROCESS.DATE_TIME ASC) AS rn1
        ,ROW_NUMBER() OVER (PARTITION BY PROCESS_VALUES.ID_ITEM, PROCESS_VALUES.ITEM_LIFE 
                            ORDER BY PROCESS.DATE_TIME DESC) AS rn2
    FROM
        PROCESS
        INNER JOIN PROCESS_VALUES ON PROCESS._ID = PROCESS_VALUES.ID_PROCESS
)
,CTE_OLDEST
AS
(
    SELECT
        ID_ITEM
        ,ITEM_LIFE
        ,ID_PROCESS
        ,ID_CLIENT
        ,DATE_TIME
    FROM CTE_RN
    WHERE rn1 = 1
)
,CTE_NEWEST
AS
(
    SELECT
        ID_ITEM
        ,ITEM_LIFE
        ,ID_PROCESS
        ,ID_CLIENT
        ,DATE_TIME
    FROM CTE_RN
    WHERE rn2 = 1
)
,CTE_GROUPS
AS
(
  SELECT
      ID_ITEM
      ,ITEM_LIFE
      ,COUNT(ID_ITEM) AS TOTAL_ITEM_PROCESS
      ,MIN(DATE_TIME) AS OLDEST_PROCESS
      ,MAX(DATE_TIME) AS NEWEST_PROCESS
      ,MAX(GREATEST(ITEM_VALUE_1, ITEM_VALUE_2, ITEM_VALUE_3)) AS MAX_ITEM_VALUE
      ,MIN(LEAST(ITEM_VALUE_1, ITEM_VALUE_2, ITEM_VALUE_3)) AS MIN_ITEM_VALUE
  FROM CTE_RN
  GROUP BY
      ID_ITEM, ITEM_LIFE
)
SELECT
    CTE_GROUPS.ID_ITEM
    ,CTE_GROUPS.ITEM_LIFE
    ,CTE_GROUPS.TOTAL_ITEM_PROCESS
    ,CTE_GROUPS.MAX_ITEM_VALUE
    ,CTE_GROUPS.MIN_ITEM_VALUE
    ,CTE_OLDEST.DATE_TIME AS OLDEST_DATE_TIME
    ,CTE_OLDEST.ID_PROCESS AS OLDEST_ID_PROCESS
    ,CTE_OLDEST.ID_CLIENT AS OLDEST_ID_CLIENT
    ,CTE_NEWEST.DATE_TIME AS NEWEST_DATE_TIME
    ,CTE_NEWEST.ID_PROCESS AS NEWEST_ID_PROCESS
    ,CTE_NEWEST.ID_CLIENT AS NEWEST_ID_CLIENT
FROM
    CTE_GROUPS
    INNER JOIN CTE_OLDEST
        ON  CTE_OLDEST.ID_ITEM = CTE_GROUPS.ID_ITEM
        AND CTE_OLDEST.ITEM_LIFE = CTE_GROUPS.ITEM_LIFE
    INNER JOIN CTE_NEWEST
        ON  CTE_NEWEST.ID_ITEM = CTE_GROUPS.ID_ITEM
        AND CTE_NEWEST.ITEM_LIFE = CTE_GROUPS.ITEM_LIFE
ORDER BY 
    ID_ITEM, ITEM_LIFE

Results:

| id_item | item_life | total_item_process | max_item_value | min_item_value |     oldest_date_time | oldest_id_process | oldest_id_client |     newest_date_time | newest_id_process | newest_id_client |
|---------|-----------|--------------------|----------------|----------------|----------------------|-------------------|------------------|----------------------|-------------------|------------------|
|     230 |         1 |                  1 |            6.5 |            1.5 | 2018-01-01T10:00:00Z |                 1 |              100 | 2018-01-01T10:00:00Z |                 1 |              100 |
|     230 |         2 |                  1 |            5.5 |            2.5 | 2018-01-02T10:00:00Z |                 2 |              200 | 2018-01-02T10:00:00Z |                 2 |              200 |
|     231 |         1 |                  1 |            7.5 |            1.5 | 2018-01-01T10:00:00Z |                 1 |              100 | 2018-01-01T10:00:00Z |                 1 |              100 |
|     231 |         2 |                  1 |           10.8 |            4.5 | 2018-01-02T10:00:00Z |                 2 |              200 | 2018-01-02T10:00:00Z |                 2 |              200 |
|     232 |         1 |                  1 |            5.6 |            0.5 | 2018-01-01T10:00:00Z |                 1 |              100 | 2018-01-01T10:00:00Z |                 1 |              100 |
|     232 |         2 |                  1 |           25.5 |            2.5 | 2018-01-02T10:00:00Z |                 2 |              200 | 2018-01-02T10:00:00Z |                 2 |              200 |
|     530 |         1 |                  2 |           11.5 |            1.5 | 2018-01-05T13:00:00Z |                 4 |              400 | 2018-01-06T13:00:00Z |                 3 |              300 |
|     531 |         1 |                  2 |            9.5 |            1.5 | 2018-01-05T13:00:00Z |                 4 |              400 | 2018-01-06T13:00:00Z |                 3 |              300 |
|     532 |         1 |                  2 |           13.5 |            4.5 | 2018-01-05T13:00:00Z |                 4 |              400 | 2018-01-06T13:00:00Z |                 3 |              300 |

The result doesn't match your expected result precisely, but I believe there are typos in the question. You said you want oldest and newest "per item per life", and items 230, 231, 232 have only one process per life, so their oldest and newest processes would be the same.

As you can see in the results of my query.

like image 146
Vladimir Baranov Avatar answered Sep 21 '22 17:09

Vladimir Baranov