Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select n amount of random rows where n is proportionate to each value's % of total population

I have a table of 58 million customer records. Each customer has a market value (EN, US, FR etc.)

I'm trying to select a 100k sample set which contains customers from every market. The ratio of customers per market in the sample must match the ratios in the actual table.

So if UK customers account for 15% of the records in the customer table then there must be 15k UK customers in the 100k sample set and the same then for each market.

Is there a way to do this?

like image 998
user3687444 Avatar asked May 29 '14 12:05

user3687444


People also ask

Which is a technique of selecting random sampling--?

This can be done in one of two ways: the lottery or random number method. In the lottery method, you choose the sample at random by “drawing from a hat” or by using a computer program that will simulate the same action. In the random number method, you assign every individual a number.

What is the difference between random sampling and stratified sampling?

A simple random sample is used to represent the entire data population and randomly selects individuals from the population without any other consideration. A stratified random sample, on the other hand, first divides the population into smaller groups, or strata, based on shared characteristics.

How do I select a random row in R?

Sample_n() function is used to select n random rows from a dataframe in R.

Which of the following is the method of unrestricted random sampling?

A simple random sampling is one in which every item of the population has an equal chance of being selected. This method is also known as unrestricted random sampling.


2 Answers

First, a simple random sample should do pretty well on representing the market sizes. What you are asking for is a stratified sample.

One way to get such a sample is to order the data randomly and assign a sequential number in each group. Then normalize the sequential number to be between 0 and 1, and finally order by the normalized value and choose the top "n" rows:

select top 100000 c.*
from (select c.*,
             row_number() over (partition by market order by rand(checksum(newid()))
                               ) as seqnum,
             count(*) over (partition by market) as cnt
      from customers c
     ) c
order by cast(seqnum as float) / cnt

It may be clear what is happening if you look at the data. Consider taking a sample of 5 from:

1    A
2    B
3    C
4    D
5    D
6    D
7    B
8    A
9    D
10   C

The first step assigns a sequential number randomly within each market:

1    A      1
2    B      1
3    C      1
4    D      1
5    D      2
6    D      3
7    B      2
8    A      2   
9    D      4
10   C      2

Next, normalize these values:

1    A      1      0.50
2    B      1      0.50
3    C      1      0.50
4    D      1      0.25
5    D      2      0.50
6    D      3      0.75
7    B      2      1.00
8    A      2      1.00
9    D      4      1.00
10   C      2      1.00

Now, if you take the top 5, you will get the first five values which is a stratified sample.

like image 163
Gordon Linoff Avatar answered Nov 03 '22 01:11

Gordon Linoff


Using a sample that big a casual extraction will give you a sample with good statitical approximation of the original population, as pointed out by Gordon Linoff.

To force the equal percentage between the population and the sample you can calculate and use all the needed parameter: the dimension of the population and the dimension of the partition, with the addition of a random ID.

Declare @sampleSize INT
Set @sampleSize = 100000

With D AS (
  SELECT customerID
       , Country
       , Count(customerID) OVER (PARTITION BY Null) TotalData
       , Count(customerID) OVER (PARTITION BY Country) CountryData
       , Row_Number() OVER (PARTITION BY Country 
                            ORDER BY rand(checksum(newid()))) ID
  FROM   customer
)
SELECT customerID
     , Country
FROM   D
WHERE  ID <= Round((Cast(CountryData as Float) / TotalData) * @sampleSize, 0)
ORDER BY Country

SQLFiddle demo with less data.

Be aware that the approximation of the function in the WHERE condition can make the returned data a little less or a little more of the desired one, for example in the demo the rows returned are 9 instead of 10.

like image 22
Serpiton Avatar answered Nov 03 '22 01:11

Serpiton