We have had a request to provide some data to an external company. They require only a sample of data, simple right? wrong.
Here is their sampling criteria:
Total Number of records divided by 720 (required sample size) - this gives sampling interval (if result is a fraction, round down to next whole number).
Halve the sampling interval to get the starting point.
- Return each record by adding on the sampling interval.
EXAMPLE:
- 10,000 Records - Sampling interval = 13 (10,000/720)
- Starting Point = 6 (13/2 Rounded)
- Return records 6, 19 (6+13), 32 (19+13), 45 (32+13) etc.....
Please can someone tell me how (if) something like this is possible in SQL.
If you have use of ROW_NUMBER(), then you can do this relatively easily.
SELECT
*
FROM
(
SELECT
ROW_NUMBER() OVER (ORDER BY a, b, c, d) AS record_id,
*
FROM
yourTable
)
AS data
WHERE
(record_id + 360) % 720 = 0
ROW_NUMBER()
gives all your data a sequential identifier (this is important as the id field must both be unique and NOT have ANY gaps). It also defines the order you want the data in (ORDER BY a, b, c, d)
.
With that id, if you use Modulo (Often the %
operator), you can test if the record is the 720th record, 1440th record, etc (because 720 % 720 = 0).
Then, if you offset your id value by 360, you can change the starting point of your result set.
EDIT
After re-reading the question, I see you don't want every 720th record, but uniformly selected 720 records.
As such, replace 720
with (SELECT COUNT(*) / 720 FROM yourTable)
And replace 360
with (SELECT (COUNT(*) / 720) / 2 FROM yourTable)
EDIT
Ignoring the rounding conditions will allow a result of exactly 720 records. This requires using non-integer values, and the result of the modulo being less than 1.
WHERE
(record_id + (SELECT COUNT(*) FROM yourTable) / 1440.0)
%
((SELECT COUNT(*) FROM yourTable) / 720.0)
<
1.0
declare @sample_size int, @starting_point int
select @sample_size = 200
select top (@sample_size) col1, col2, col3, col4
from (
select *, row_number() over (order by col1, col2) as row
from your_table
) t
where (row % ((select count(*) from your_table) / @sample_size)) - (select count(*) from your_table) / @sample_size / 2) = 0
It's going to work in SQL Server 2005+.
TOP (@variable) is used to limit rows (where
condition because of integers rounding might not be enough, may return more rows then needed) and ROW_NUMBER() to number and order rows.
Working example: https://data.stackexchange.com/stackoverflow/query/62315/sql-data-sampling below code:
declare @tab table (id int identity(1,1), col1 varchar(3), col2 varchar(3))
declare @i int
set @i = 0
while @i <= 1000
begin
insert into @tab
select 'aaa', 'bbb'
set @i = @i+1
end
declare @sample_size int
select @sample_size = 123
select ((select count(*) from @tab) / @sample_size) as sample_interval
select top (@sample_size) *
from (
select *, row_number() over (order by col1, col2, id desc) as row
from @tab
) t
where (row % ((select count(*) from @tab) / @sample_size)) - ((select count(*) from @tab) / @sample_size / 2) = 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With