Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL random sample with groups

I have a university graduate database and would like to extract a random sample of data of around 1000 records.

I want to ensure the sample is representative of the population so would like to include the same proportions of courses eg

enter image description here

I could do this using the following:

select top 500 id from degree where coursecode = 1 order by newid()
union
select top 300 id from degree where coursecode = 2 order by newid()
union
select top 200 id from degree where coursecode = 3 order by newid()

but we have hundreds of courses codes so this would be time consuming and I would like to be able to reuse this code for different sample sizes and don't particularly want to go through the query and hard code the sample sizes.

Any help would be greatly appreciated

like image 504
Simon Avatar asked May 14 '15 10:05

Simon


People also ask

How do I randomly sample in SQL?

To get a single row randomly, we can use the LIMIT Clause and set to only one row. ORDER BY clause in the query is used to order the row(s) randomly. It is exactly the same as MYSQL. Just replace RAND( ) with RANDOM( ).

How do I select random 100 rows in SQL?

The function RAND() generates a random value for each row in the table. The ORDER BY clause sorts all rows in the table by the random number generated by the RAND() function. The LIMIT clause picks the first row in the result set sorted randomly.

How do I select a random row by group in SQL?

Below SQL statement is to display the defined number of random rows from a table using RAND() function: Query: SELECT * FROM table_name order by RANDOM() LIMIT n; In table_name mention your Table Name and in the place of 'n' give how many rows to be fetched.


2 Answers

Add a table for storing population.

I think it should be like this:

SELECT *
FROM (
    SELECT id, coursecode, ROW_NUMBER() OVER (PARTITION BY coursecode ORDER BY NEWID()) AS rn
    FROM degree) t
    LEFT OUTER JOIN
    population p ON t.coursecode = p.coursecode
WHERE
    rn <= p.SampleSize
like image 22
shA.t Avatar answered Sep 18 '22 18:09

shA.t


You want a stratified sample. I would recommend doing this by sorting the data by course code and doing an nth sample. Here is one method that works best if you have a large population size:

select d.*
from (select d.*,
             row_number() over (order by coursecode, newid) as seqnum,
             count(*) over () as cnt
      from degree d
     ) d
where seqnum % (cnt / 500) = 1;

EDIT:

You can also calculate the population size for each group "on the fly":

select d.*
from (select d.*,
             row_number() over (partition by coursecode order by newid) as seqnum,
             count(*) over () as cnt,
             count(*) over (partition by coursecode) as cc_cnt
      from degree d
     ) d
where seqnum < 500 * (cc_cnt * 1.0 / cnt)
like image 134
Gordon Linoff Avatar answered Sep 20 '22 18:09

Gordon Linoff