Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Taking N-samples from each group in PostgreSQL

I have a table containing data that has a column named id that looks like below:

id value 1 value 2 value 3
1 244 550 1000
1 251 551 700
1 540 60 1200
... ... ... ...
2 19 744 2000
2 10 903 100
2 44 231 600
2 120 910 1100
... ... ... ...

I want to take 50 sample rows per id that exists but if less than 50 exist for the group to simply take the entire set of data points.

For example I would like a maximum 50 data points randomly selected from id = 1, id = 2 etc...

I cannot find any previous questions similar to this but have tried taking a stab at at least logically working through the solution where I could iterate and union all queries by id and limit to 50:

SELECT * FROM (SELECT * FROM schema.table AS tbl WHERE tbl.id = X LIMIT 50) UNION ALL;

But it's obvious that you cannot use this type of solution because UNION ALL requires aggregating outputs from one id to the next and I do not have a list of id values to use in place of X in tbl.id = X.

Is there a way to accomplish this by gathering that list of unique id values and union all results or is there a more optimal way this could be done?

like image 222
HelloToEarth Avatar asked Sep 12 '25 14:09

HelloToEarth


1 Answers

If you want to select a random sample for each id, then you need to randomize the rows somehow. Here is a way to do it:

select * from (
  select *, row_number() over (partition by id order by random()) as u
  from schema.table
) as a
where u <= 50;

Example (limiting to 3, and some row number for each id so you can see the selection randomness):

  1. setup
DROP TABLE IF EXISTS foo;

CREATE TABLE foo
(
  id int,
  value1 int,
  idrow int
  );

INSERT INTO foo
select 1 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow
union all
select 2 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow
union all
select 3 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow;
  1. Selection
select * from (
  select *, row_number() over (partition by id order by random()) as u
  from foo
) as a
where u <= 3;

Output:

id value1 idrow u
1 542 6 1
1 24 86 2
1 155 74 3
2 505 95 1
2 100 46 2
2 422 33 3
3 966 88 1
3 747 89 2
3 664 19 3
like image 181
Pierre D Avatar answered Sep 14 '25 23:09

Pierre D