I have a university graduate database and would like to extract a random sample of data of around 1000 records.
I want to ensure the sample is representative of the population so would like to include the same proportions of courses eg
I could do this using the following:
select top 500 id from degree where coursecode = 1 order by newid()
union
select top 300 id from degree where coursecode = 2 order by newid()
union
select top 200 id from degree where coursecode = 3 order by newid()
but we have hundreds of courses codes so this would be time consuming and I would like to be able to reuse this code for different sample sizes and don't particularly want to go through the query and hard code the sample sizes.
Any help would be greatly appreciated
To get a single row randomly, we can use the LIMIT Clause and set to only one row. ORDER BY clause in the query is used to order the row(s) randomly. It is exactly the same as MYSQL. Just replace RAND( ) with RANDOM( ).
The function RAND() generates a random value for each row in the table. The ORDER BY clause sorts all rows in the table by the random number generated by the RAND() function. The LIMIT clause picks the first row in the result set sorted randomly.
Below SQL statement is to display the defined number of random rows from a table using RAND() function: Query: SELECT * FROM table_name order by RANDOM() LIMIT n; In table_name mention your Table Name and in the place of 'n' give how many rows to be fetched.
Add a table for storing population
.
I think it should be like this:
SELECT *
FROM (
SELECT id, coursecode, ROW_NUMBER() OVER (PARTITION BY coursecode ORDER BY NEWID()) AS rn
FROM degree) t
LEFT OUTER JOIN
population p ON t.coursecode = p.coursecode
WHERE
rn <= p.SampleSize
You want a stratified sample. I would recommend doing this by sorting the data by course code and doing an nth sample. Here is one method that works best if you have a large population size:
select d.*
from (select d.*,
row_number() over (order by coursecode, newid) as seqnum,
count(*) over () as cnt
from degree d
) d
where seqnum % (cnt / 500) = 1;
EDIT:
You can also calculate the population size for each group "on the fly":
select d.*
from (select d.*,
row_number() over (partition by coursecode order by newid) as seqnum,
count(*) over () as cnt,
count(*) over (partition by coursecode) as cc_cnt
from degree d
) d
where seqnum < 500 * (cc_cnt * 1.0 / cnt)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With