Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Data Sampling

Tags:

sql

We have had a request to provide some data to an external company. They require only a sample of data, simple right? wrong.

Here is their sampling criteria:

  • Total Number of records divided by 720 (required sample size) - this gives sampling interval (if result is a fraction, round down to next whole number).

  • Halve the sampling interval to get the starting point.

  • Return each record by adding on the sampling interval.

EXAMPLE:

  • 10,000 Records - Sampling interval = 13 (10,000/720)
  • Starting Point = 6 (13/2 Rounded)
  • Return records 6, 19 (6+13), 32 (19+13), 45 (32+13) etc.....

Please can someone tell me how (if) something like this is possible in SQL.

like image 385
Richard.Gale Avatar asked Feb 23 '12 12:02

Richard.Gale


2 Answers

If you have use of ROW_NUMBER(), then you can do this relatively easily.

SELECT
  *
FROM
(
  SELECT
    ROW_NUMBER() OVER (ORDER BY a, b, c, d) AS record_id,
    *
  FROM
    yourTable
)
  AS data
WHERE
  (record_id + 360) % 720 = 0

ROW_NUMBER() gives all your data a sequential identifier (this is important as the id field must both be unique and NOT have ANY gaps). It also defines the order you want the data in (ORDER BY a, b, c, d).

With that id, if you use Modulo (Often the % operator), you can test if the record is the 720th record, 1440th record, etc (because 720 % 720 = 0).

Then, if you offset your id value by 360, you can change the starting point of your result set.

EDIT

After re-reading the question, I see you don't want every 720th record, but uniformly selected 720 records.

As such, replace 720 with (SELECT COUNT(*) / 720 FROM yourTable)

And replace 360 with (SELECT (COUNT(*) / 720) / 2 FROM yourTable)

EDIT

Ignoring the rounding conditions will allow a result of exactly 720 records. This requires using non-integer values, and the result of the modulo being less than 1.

WHERE
  (record_id + (SELECT COUNT(*) FROM yourTable) / 1440.0)
  %
  ((SELECT COUNT(*) FROM yourTable) / 720.0)
  <
  1.0
like image 105
MatBailie Avatar answered Sep 28 '22 00:09

MatBailie


declare @sample_size int, @starting_point int

select @sample_size = 200

select top (@sample_size) col1, col2, col3, col4 
from (
    select *, row_number() over (order by col1, col2) as row
    from your_table
) t
where  (row   % ((select count(*) from your_table) / @sample_size)) - (select count(*) from your_table) / @sample_size / 2) = 0

It's going to work in SQL Server 2005+.

TOP (@variable) is used to limit rows (where condition because of integers rounding might not be enough, may return more rows then needed) and ROW_NUMBER() to number and order rows.

Working example: https://data.stackexchange.com/stackoverflow/query/62315/sql-data-sampling below code:

declare @tab table (id int identity(1,1), col1 varchar(3), col2 varchar(3))

declare @i int

set @i = 0

while @i <= 1000
begin
  insert into @tab
  select 'aaa', 'bbb'
  set @i = @i+1
end

declare @sample_size int

select @sample_size = 123

select ((select count(*) from @tab) / @sample_size) as sample_interval

select top (@sample_size) *
from (
    select *, row_number() over (order by col1, col2, id desc) as row
    from @tab
) t
where  (row   % ((select count(*) from @tab) / @sample_size)) - ((select count(*) from @tab) / @sample_size / 2) = 0
like image 26
Michał Powaga Avatar answered Sep 28 '22 00:09

Michał Powaga