I have a table of 58 million customer records. Each customer has a market value (EN, US, FR etc.) I'm trying to select a 100k sample set which contains customers from every market. The ratio of customers per market in the sample must match the ratios in the actual table. So if UK customers account for 15% of the records in the customer table then there must be 15k UK customers in the 100k sample set and the same then for each market. Is there a way to do this?

First, a simple random sample should do pretty well on representing the market sizes. What you are asking for is a stratified sample. One way to get such a sample is to order the data randomly and assign a sequential number in each group. Then normalize the sequential number to be between 0 and 1, and finally order by the normalized value and choose the top "n" rows: <pre class="prettyprint"><code>select top 100000 c.* from (select c.*, row_number() over (partition by market order by rand(checksum(newid())) ) as seqnum, count(*) over (partition by market) as cnt from customers c ) c order by cast(seqnum as float) / cnt </code></pre> It may be clear what is happening if you look at the data. Consider taking a sample of 5 from: <pre class="prettyprint"><code>1 A 2 B 3 C 4 D 5 D 6 D 7 B 8 A 9 D 10 C </code></pre> The first step assigns a sequential number randomly within each market: <pre class="prettyprint"><code>1 A 1 2 B 1 3 C 1 4 D 1 5 D 2 6 D 3 7 B 2 8 A 2 9 D 4 10 C 2 </code></pre> Next, normalize these values: <pre class="prettyprint"><code>1 A 1 0.50 2 B 1 0.50 3 C 1 0.50 4 D 1 0.25 5 D 2 0.50 6 D 3 0.75 7 B 2 1.00 8 A 2 1.00 9 D 4 1.00 10 C 2 1.00 </code></pre> Now, if you take the top 5, you will get the first five values which is a stratified sample.

Select n amount of random rows where n is proportionate to each value's % of total population

2 Answers

First, a simple random sample should do pretty well on representing the market sizes. What you are asking for is a stratified sample.

One way to get such a sample is to order the data randomly and assign a sequential number in each group. Then normalize the sequential number to be between 0 and 1, and finally order by the normalized value and choose the top "n" rows:

select top 100000 c.*
from (select c.*,
             row_number() over (partition by market order by rand(checksum(newid()))
                               ) as seqnum,
             count(*) over (partition by market) as cnt
      from customers c
     ) c
order by cast(seqnum as float) / cnt

It may be clear what is happening if you look at the data. Consider taking a sample of 5 from:

The first step assigns a sequential number randomly within each market:

1    A      1
2    B      1
3    C      1
4    D      1
5    D      2
6    D      3
7    B      2
8    A      2   
9    D      4
10   C      2

Next, normalize these values:

1    A      1      0.50
2    B      1      0.50
3    C      1      0.50
4    D      1      0.25
5    D      2      0.50
6    D      3      0.75
7    B      2      1.00
8    A      2      1.00
9    D      4      1.00
10   C      2      1.00

Now, if you take the top 5, you will get the first five values which is a stratified sample.

163

answered Nov 03 '22 01:11

Gordon Linoff

Using a sample that big a casual extraction will give you a sample with good statitical approximation of the original population, as pointed out by Gordon Linoff.

To force the equal percentage between the population and the sample you can calculate and use all the needed parameter: the dimension of the population and the dimension of the partition, with the addition of a random ID.

Declare @sampleSize INT
Set @sampleSize = 100000

With D AS (
  SELECT customerID
       , Country
       , Count(customerID) OVER (PARTITION BY Null) TotalData
       , Count(customerID) OVER (PARTITION BY Country) CountryData
       , Row_Number() OVER (PARTITION BY Country 
                            ORDER BY rand(checksum(newid()))) ID
  FROM   customer
)
SELECT customerID
     , Country
FROM   D
WHERE  ID <= Round((Cast(CountryData as Float) / TotalData) * @sampleSize, 0)
ORDER BY Country

SQLFiddle demo with less data.

Be aware that the approximation of the function in the WHERE condition can make the returned data a little less or a little more of the desired one, for example in the demo the rows returned are 9 instead of 10.

answered Nov 03 '22 01:11

Serpiton

Related questions
                            
                                Group by floating date range
                            
                                SQL Server 2012 linked server Application Intent
                            
                                How to Export data to Excel in SQL Server using SQL Jobs
                            
                                MYSQL JOIN two tables limit results from second table by date
                            
                                SQL: join table showing null value as well
                            
                                export gridview to excel with custom value formatting
                            
                                Dummy where clauses effects on performance
                            
                                Multiple INNER JOIN with GROUP BY and Aggregate Function
                            
                                SQL query with comments import into R from file
                            
                                Now() vs GetDate()
                            
                                How to turn a huge live database into a small testing database?
                            
                                MSSQL BIT_COUNT (Hammingdistance)
                            
                                sqlConnection/Command using statement + try/catch block [duplicate]
                            
                                DQL query to return all files in a Cabinet in Documentum?
                            
                                SQL (+)= definition and function
                            
                                How to join multiple tables by date range in SQL?
                            
                                mysql grant select privilege on only one table and some columns of it
                            
                                Can not determine what the WHERE clause should be
                            
                                How to normalize data efficently while INSERTing into SQL table (Postgres)
                            
                                MySQL select all dates that are an increment of x days

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Select n amount of random rows where n is proportionate to each value's % of total population

Tags:

sql

sql-server

random

user3687444

People also ask

2 Answers

Gordon Linoff

Serpiton

Recent Activity

Donate For Us