SQL random sample with groups

Tags:

I have a university graduate database and would like to extract a random sample of data of around 1000 records.

I want to ensure the sample is representative of the population so would like to include the same proportions of courses eg

enter image description here

I could do this using the following:

select top 500 id from degree where coursecode = 1 order by newid()
union
select top 300 id from degree where coursecode = 2 order by newid()
union
select top 200 id from degree where coursecode = 3 order by newid()

but we have hundreds of courses codes so this would be time consuming and I would like to be able to reuse this code for different sample sizes and don't particularly want to go through the query and hard code the sample sizes.

Any help would be greatly appreciated

504

asked May 14 '15 10:05

Simon

2 Answers

Add a table for storing population.

I think it should be like this:

SELECT *
FROM (
    SELECT id, coursecode, ROW_NUMBER() OVER (PARTITION BY coursecode ORDER BY NEWID()) AS rn
    FROM degree) t
    LEFT OUTER JOIN
    population p ON t.coursecode = p.coursecode
WHERE
    rn <= p.SampleSize

answered Sep 18 '22 18:09

shA.t

You want a stratified sample. I would recommend doing this by sorting the data by course code and doing an nth sample. Here is one method that works best if you have a large population size:

select d.*
from (select d.*,
             row_number() over (order by coursecode, newid) as seqnum,
             count(*) over () as cnt
      from degree d
     ) d
where seqnum % (cnt / 500) = 1;

EDIT:

You can also calculate the population size for each group "on the fly":

select d.*
from (select d.*,
             row_number() over (partition by coursecode order by newid) as seqnum,
             count(*) over () as cnt,
             count(*) over (partition by coursecode) as cc_cnt
      from degree d
     ) d
where seqnum < 500 * (cc_cnt * 1.0 / cnt)

134

answered Sep 20 '22 18:09

Gordon Linoff

Related questions
                            
                                MySQL "Sending data" horribly slow
                            
                                How do you specify 'DEFAULT' as a SQL parameter value in ADO.NET?
                            
                                Why doesn't Oracle raise "ORA-00918: column ambiguously defined" for this query?
                            
                                Oracle/SQL - Grouping items by action by day over time
                            
                                sqlite3 JOIN, GROUP_CONCAT using distinct with custom separator
                            
                                DATEDIFF in HH:MM:SS format
                            
                                Create a new table and adding a primary key using SELECT INTO
                            
                                Connecting to Oracle Database using Sql Server Integration Services
                            
                                One SQL query to access multiple data sources in Java (from oracle, excel, sql server)
                            
                                SQL How to correctly set a date variable value and use it?
                            
                                SQL Server (TSQL) - Is it possible to EXEC statements in parallel?
                            
                                How to do "where exists" in Arel
                            
                                CREATE TABLE IF NOT EXISTS fails with table already exists [duplicate]
                            
                                Why is truncate a DDL statement?
                            
                                How to add new column in existing View in SQL-Server 2014 using Alter
                            
                                Transpose rows into columns in BigQuery (Pivot implementation) [duplicate]
                            
                                SQL Constraint Validate Unique Values
                            
                                sql group by only rows which are in sequence
                            
                                Is there a way to get the row number in Mysql like the rownum in oracle [duplicate]
                            
                                Select the first matching row

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SQL random sample with groups

Tags:

sql

sql-server

sample

random-sample

Simon

People also ask

2 Answers

shA.t

Gordon Linoff

Recent Activity

Donate For Us