Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL - 5% random sample by group

I have a table with about 10 million rows and 4 columns, no primary key. Data in Column 2 3 4 (x2 x3 and x4) are grouped by 50 groups identified in column1 X1.

To get a random sample of 5% from table, I have always used

SELECT TOP 5 PERCENT *
FROM thistable
ORDER BY NEWID()

The result returns about 500,000 rows. But, some groups get an unequal representation in the sample (relative to their original size) if sampled this way.

This time, to get a better sample, I wanted to get 5% sample from each of the 50 groups identified in column X1. So, at the end, I can get a random sample of 5% of rows in each of the 50 groups in X1 (instead of 5% of entire table).

How can I approach this problem? Thank you.

like image 378
Siftekhar Avatar asked Nov 18 '13 19:11

Siftekhar


People also ask

How do you find the top 10% in SQL?

Example - Using TOP PERCENT keywordSELECT TOP(10) PERCENT contact_id, last_name, first_name FROM contacts WHERE last_name = 'Anderson' ORDER BY contact_id; This SQL SELECT TOP example would select the first 10% of the records from the full result set.


1 Answers

You need to be able to count each group and then coerce the data out in a random order. Fortuantly, we can do this with a CTE-style query. Although CTE isn't strictly needed it will help break down the solution into little bits, rather than a lots of sub-selects and the like.

I assume you've already got a column that groups the data, and that the value in this column is the same for all items in the group. If so, something like this might work (columns and table names to be changed to suit your situation):

WITH randomID AS (
    -- First assign a random ID to all rows. This will give us a random order.
    SELECT *, NEWID() as random FROM sourceTable
),
countGroups AS (
    -- Now we add row numbers for each group. So each group will start at 1. We order 
    -- by the random column we generated in the previous expression, so you should get
    -- different results in each execution
    SELECT *, ROW_NUMBER() OVER (PARTITION BY groupcolumn ORDER BY random) AS rowcnt FROM randomID
)
-- Now we get the data
SELECT * 
    FROM countGroups c1
    WHERE rowcnt <= (
        SELECT MAX(rowcnt) / 20 FROM countGroups c2 WHERE c1.groupcolumn = c2.groupcolumn
    )

The two CTE expressions allow you to randomly order and then count each group. The final select should then be fairly straightforward: for each group, find out how many rows there are in it, and only return 5% of them (total_row_count_in_group / 20).

like image 75
Chris J Avatar answered Sep 20 '22 15:09

Chris J