There are lots of data coming from multiple sources that I need to group based on priority, but the data quality from those sources is different - they may be missing some data. The task is to group that data into a separate table, in as complete as possible way.
For example:
create table grouped_data (
id serial primary key,
type text,
a text,
b text,
c int
);
create table raw_data (
id serial primary key,
type text,
a text,
b text,
c int,
priority int
);
insert into raw_data
(type, a, b, c, priority)
values
('one', null, '', 123, 1),
('one', 'foo', '', 456, 2),
('one', 'bar', 'baz', 789, 3),
('two', null, 'two-b', 11, 3),
('two', '', '', 33, 2),
('two', null, 'two-bbb', 22, 1);
Now I need to group records by type
, order by priority
, take the first non-null and non-empty value, and put it into grouped_data
.
In this case, value of a
for group one
would be foo
because the row that holds that value have a higher priority than the one with bar
. And c
should be 123
, as it has the highest prio.
Same for group two
, for each column we take the data that is non-null, non-empty, and has the highest priority, or fallback to null
if no actual data present.
In the end, grouped_data
is expected to have the following content:
('one', 'foo', 'baz', 123),
('two', null, 'two-bbb', 22)
I've tried grouping, sub-selects, MERGE, cross joins... Alas, my knowledge of PostgreSQL is not good enough to get it working. One thing I'd like to avoid, too - is going through columns one-by-one, since in the real world there are few dozens of columns to work with...
A link to a fiddle I've been using to mess around with this: http://sqlfiddle.com/#!17/76699/1
UPD:
Thank you all! Oleksii Tambovtsev's solution is the fastest one. On a set of data closely resembling a real-world case (2m records, ~30 fields) it takes only 20 seconds to produce the exact same set of data, which was previously generated programmatically and took over 20 minutes.
eshirvana's solution does the same in 95s, Steve Kass' in 125s, and Stefanov.sm - 308s (which is still helluvalotfaster than programatically!)
Thank you all :)
You should try this:
SELECT
type,
(array_agg(a ORDER BY priority ASC) FILTER (WHERE a IS NOT NULL AND a != ''))[1] as a,
(array_agg(b ORDER BY priority ASC) FILTER (WHERE b IS NOT NULL AND b != ''))[1] as b,
(array_agg(c ORDER BY priority ASC) FILTER (WHERE c IS NOT NULL))[1] as c
FROM raw_data GROUP BY type ORDER BY type;
you can use window function first_value
:
select distinct
type
, first_value(a) over (partition by type order by nullif(a,'') is null, priority) as a
, first_value(b) over (partition by type order by nullif(b,'') is null, priority) as b
, first_value(c) over (partition by type order by priority) as c
from raw_data
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With