Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I SUM distinct records in a Postgres database where there are duplicate records?

Tags:

postgresql

Imagine a table that looks like this:

table with duplicate data

The SQL to get this data was just SELECT * The first column is "row_id" the second is "id" - which is the order ID and the third is "total" - which is the revenue.

I'm not sure why there are duplicate rows in the database, but when I do a SUM(total), it's including the second entry in the database, even though the order ID is the same, which is causing my numbers to be larger than if I select distinct(id), total - export to excel and then sum the values manually.

So my question is - how can I SUM on just the distinct order IDs so that I get the same revenue as if I exported to excel every distinct order ID row?

Thanks in advance!

like image 823
Katie F Avatar asked Apr 10 '16 00:04

Katie F


People also ask

How do you sum unique values in PostgreSQL?

Introduction to PostgreSQL SUM() functionIf you use the DISTINCT option, the SUM() function calculates the sum of distinct values. For example, without the DISTINCT option, the SUM() of 1, 1, 8, and 2 will return 12.

What is the difference between distinct and distinct on in PostgreSQL?

In simple words, if we apply DISTINCT ON, it will provide us with the first result from the group of results. Based on the query's ORDER BY and DISTINCT ON(column) clauses, the DISTINCT ON clause will only return the first row. It will return the relevant values for other columns.

How do I stop inserting duplicate records in PostgreSQL?

Using WHERE NOT EXISTS It will avoid inserting the same records more than once.


2 Answers

Easy - just divide by the count:

select id, sum(total) / count(id)
from orders
group by id

See live demo.

Also handles any level of duplication, eg triplicates etc.

like image 138
Bohemian Avatar answered Sep 24 '22 00:09

Bohemian


You can try something like this (with your example):

Table

create table test (
  row_id int,
  id int,
  total decimal(15,2)
);

insert into test values 
(6395, 1509, 112), (22986, 1509, 112), 
(1393, 3284, 40.37), (24360, 3284, 40.37);

Query

with distinct_records as (
  select distinct id, total from test
)

select a.id, b.actual_total, array_agg(a.row_id) as row_ids
from test a
inner join (select id, sum(total) as actual_total from distinct_records group by id) b
  on a.id = b.id
group by a.id, b.actual_total

Result

|   id | actual_total |    row_ids |
|------|--------------|------------|
| 1509 |          112 | 6395,22986 |
| 3284 |        40.37 | 1393,24360 |

Explanation

We do not know what the reasons is for orders and totals to appear more than one time with different row_id. So using a common table expression (CTE) using the with ... phrase, we get the distinct id and total.

Under the CTE, we use this distinct data to do totaling. We join ID in the original table with the aggregation over distinct values. Then we comma-separate row_ids so that the information looks cleaner.

SQLFiddle example

http://sqlfiddle.com/#!15/72639/3

like image 25
zedfoxus Avatar answered Sep 22 '22 00:09

zedfoxus