Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres - select non-blank non-null values from multiple ordered rows

There are lots of data coming from multiple sources that I need to group based on priority, but the data quality from those sources is different - they may be missing some data. The task is to group that data into a separate table, in as complete as possible way.

For example:

create table grouped_data (
  id serial primary key,
  type text,
  a text,
  b text,
  c int
);

create table raw_data (
  id serial primary key,
  type text,
  a text,
  b text,
  c int,
  priority int
);


insert into raw_data
(type, a,       b,         c,   priority)
values
('one', null,    '',        123, 1),
('one', 'foo',   '',        456, 2),
('one', 'bar',   'baz',     789, 3),
('two', null,    'two-b',   11,  3),
('two', '',      '',        33,  2),
('two', null,    'two-bbb', 22,  1);

Now I need to group records by type, order by priority, take the first non-null and non-empty value, and put it into grouped_data. In this case, value of a for group one would be foo because the row that holds that value have a higher priority than the one with bar. And c should be 123, as it has the highest prio. Same for group two, for each column we take the data that is non-null, non-empty, and has the highest priority, or fallback to null if no actual data present.

In the end, grouped_data is expected to have the following content:

('one', 'foo', 'baz',     123),
('two', null,  'two-bbb', 22)

I've tried grouping, sub-selects, MERGE, cross joins... Alas, my knowledge of PostgreSQL is not good enough to get it working. One thing I'd like to avoid, too - is going through columns one-by-one, since in the real world there are few dozens of columns to work with...

A link to a fiddle I've been using to mess around with this: http://sqlfiddle.com/#!17/76699/1


UPD:

Thank you all! Oleksii Tambovtsev's solution is the fastest one. On a set of data closely resembling a real-world case (2m records, ~30 fields) it takes only 20 seconds to produce the exact same set of data, which was previously generated programmatically and took over 20 minutes.

eshirvana's solution does the same in 95s, Steve Kass' in 125s, and Stefanov.sm - 308s (which is still helluvalotfaster than programatically!)

Thank you all :)

like image 713
Alfuken Avatar asked Dec 22 '21 18:12

Alfuken


2 Answers

You should try this:

SELECT
       type,
       (array_agg(a ORDER BY priority ASC) FILTER (WHERE a IS NOT NULL AND a != ''))[1] as a,
       (array_agg(b ORDER BY priority ASC) FILTER (WHERE b IS NOT NULL AND b != ''))[1] as b,
       (array_agg(c ORDER BY priority ASC) FILTER (WHERE c IS NOT NULL))[1] as c
FROM raw_data GROUP BY type ORDER BY type;
like image 169
Oleksii Tambovtsev Avatar answered Nov 15 '22 07:11

Oleksii Tambovtsev


you can use window function first_value:

select distinct 
    type 
  , first_value(a) over (partition by type order by nullif(a,'') is null, priority) as a
  , first_value(b) over (partition by type order by nullif(b,'') is null, priority)  as b
  , first_value(c) over (partition by type order by priority) as c
from raw_data 
like image 25
eshirvana Avatar answered Nov 15 '22 07:11

eshirvana