I have a table of 58 million customer records. Each customer has a market value (EN, US, FR etc.)
I'm trying to select a 100k sample set which contains customers from every market. The ratio of customers per market in the sample must match the ratios in the actual table.
So if UK customers account for 15% of the records in the customer table then there must be 15k UK customers in the 100k sample set and the same then for each market.
Is there a way to do this?
This can be done in one of two ways: the lottery or random number method. In the lottery method, you choose the sample at random by “drawing from a hat” or by using a computer program that will simulate the same action. In the random number method, you assign every individual a number.
A simple random sample is used to represent the entire data population and randomly selects individuals from the population without any other consideration. A stratified random sample, on the other hand, first divides the population into smaller groups, or strata, based on shared characteristics.
Sample_n() function is used to select n random rows from a dataframe in R.
A simple random sampling is one in which every item of the population has an equal chance of being selected. This method is also known as unrestricted random sampling.
First, a simple random sample should do pretty well on representing the market sizes. What you are asking for is a stratified sample.
One way to get such a sample is to order the data randomly and assign a sequential number in each group. Then normalize the sequential number to be between 0 and 1, and finally order by the normalized value and choose the top "n" rows:
select top 100000 c.*
from (select c.*,
row_number() over (partition by market order by rand(checksum(newid()))
) as seqnum,
count(*) over (partition by market) as cnt
from customers c
) c
order by cast(seqnum as float) / cnt
It may be clear what is happening if you look at the data. Consider taking a sample of 5 from:
1 A
2 B
3 C
4 D
5 D
6 D
7 B
8 A
9 D
10 C
The first step assigns a sequential number randomly within each market:
1 A 1
2 B 1
3 C 1
4 D 1
5 D 2
6 D 3
7 B 2
8 A 2
9 D 4
10 C 2
Next, normalize these values:
1 A 1 0.50
2 B 1 0.50
3 C 1 0.50
4 D 1 0.25
5 D 2 0.50
6 D 3 0.75
7 B 2 1.00
8 A 2 1.00
9 D 4 1.00
10 C 2 1.00
Now, if you take the top 5, you will get the first five values which is a stratified sample.
Using a sample that big a casual extraction will give you a sample with good statitical approximation of the original population, as pointed out by Gordon Linoff.
To force the equal percentage between the population and the sample you can calculate and use all the needed parameter: the dimension of the population and the dimension of the partition, with the addition of a random ID.
Declare @sampleSize INT
Set @sampleSize = 100000
With D AS (
SELECT customerID
, Country
, Count(customerID) OVER (PARTITION BY Null) TotalData
, Count(customerID) OVER (PARTITION BY Country) CountryData
, Row_Number() OVER (PARTITION BY Country
ORDER BY rand(checksum(newid()))) ID
FROM customer
)
SELECT customerID
, Country
FROM D
WHERE ID <= Round((Cast(CountryData as Float) / TotalData) * @sampleSize, 0)
ORDER BY Country
SQLFiddle demo with less data.
Be aware that the approximation of the function in the WHERE
condition can make the returned data a little less or a little more of the desired one, for example in the demo the rows returned are 9 instead of 10.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With