Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random sampling from a large dataset

There's a large database from which I have extracted a study population. For comparison purposes, I would like to select a control group that has similar characteristics. The two criteria on what I would like to match are age and gender. The query to give me the numbers that I want for matching purposes is

select sex, age/10 as decades,COUNT(*) as counts
        from
        (
            select distinct m.patid
                ,m.sex,DATEPART(year,min(c.admitdate)) -m.yrdob as Age
                from members as m
                inner join claims as c on c.patid=m.PATID
                group by m.PATID, m.sex,m.yrdob
        )x group by sex, Age/10

with a result set that looks like

enter image description here

The decades column in this age is given by the expression

(DATEPART(year,min(c.admitdate)) -m.yrdob)/10 

and this is used to find people in the age ranges of 20-29, 30-39 etc using integer division. From a larger dataset I would like to select, for instance, 507 females who are in their 20s. The query to find the characteristics of the larger dataset is

select distinct m.patid
        ,m.sex
        ,(DATEPART(year,min(c.admitdate)) -m.yrdob)/10 as decades
        from members as m
        inner join claims as c on c.patid=m.PATID
        group by m.PATID, m.sex,m.yrdob

EDIT: results from second query enter image description here

So I need the sum of the decades column in the second query to equal counts in the first query. What I tried (and returned zero results) is below. What do I need to do to match these ages?

query that runs, but returns no results:

select x.PATID--,x.sex,x.decades,y.counts
    from
    (

    select distinct m.patid
        ,m.sex
        ,(DATEPART(year,min(c.admitdate)) -m.yrdob)/10 as decades
        from members as m
        inner join claims as c on c.patid=m.PATID
        group by m.PATID, m.sex,m.yrdob
    ) as x 
    inner join 
    (

        select sex, age/10 as decades,COUNT(*) as counts
        from
        (
            select distinct m.patid
                ,m.sex,DATEPART(year,min(c.admitdate)) -m.yrdob as Age
                from members as m
                inner join claims as c on c.patid=m.PATID
                group by m.PATID, m.sex,m.yrdob
        )x group by sex, Age/10
    ) as y on x.sex=y.sex and x.decades=y.decades
    group by y.counts,x.PATID,x.sex,y.sex
    having SUM(x.decades)=y.counts and x.sex=y.sex
like image 456
wootscootinboogie Avatar asked Nov 13 '22 16:11

wootscootinboogie


1 Answers

select
   T1.sex,
   T1.decades,
   T1.counts,
   T2.patid

from (

   select 
      sex, 
      age/10 as decades,
      COUNT(*) as counts
   from (

      select  m.patid,
         m.sex,
         DATEPART(year,min(c.admitdate)) -m.yrdob as Age
      from members as m
      inner join claims as c on c.patid=m.PATID
      group by m.PATID, m.sex,m.yrdob
   )x 
   group by sex, Age/10
) as T1
join (
   --right here is where the random sampling occurs
    SELECT TOP 50--this is the total number of peolpe in our dataset
      patid
      ,sex
      ,decades

   from (
      select  m.patid,
         m.sex,
         (DATEPART(year,min(c.admitdate)) -m.yrdob)/10 as decades
      from members as m
      inner join claims as c on c.patid=m.PATID
      group by m.PATID, m.sex, m.yrdob

   ) T2
      order by NEWID()
) as T2
on T2.sex = T1.sex
and T2.decades = T1.decades 

EDIT: I had posted another question similar to this in which I found that my results weren't in fact random, but they were only the TOP N results. I had ordered by newid() in the outermost query and all that was doing was shuffling around the exact same result set. From a question that is now closed, I found out that I needed to use the TOP keyword along with order by newid() in the commented line in the above query.

like image 81
wootscootinboogie Avatar answered Nov 15 '22 07:11

wootscootinboogie