Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sampling from Oracle, Need exact number of results (Sample Clause)

I am trying to pull a random sample of a population from a Peoplesoft Database. The searches online have lead me to think that the Sample Clause of the select statement may be a viable option for us to use, however I am having trouble understanding how the Sample clause determines the number of samples returned. I have looked at the oracle documentation found here: http://docs.oracle.com/cd/E11882_01/server.112/e26088/statements_10002.htm#i2065953

But the above reference only talks about the syntax used to create the sample. The reason for my question is I need to understand how the sample percent determines the sample size returned. It seems like it applies a random number to the percent you ask for and then uses a seed number to count every "n" records. Our requirement is that we pull an exact number of samples for example, that they are randomly selected, and that they are representative of the entire table (or at least the grouping of data we choose with filters)

In a population of 10200 items if I need a sample of approximately 100 items, I could use this statement:

SELECT * FROM PS_LEDGER SAMPLE(1) --1 % of my total population
WHERE DEPTID = '700064' 

However, We need to pull an exact number of samples (in this case 100) so I could pick a sample size that almost always returns more than the number I need then trim it down IE

SELECT Count(*) FROM PS_LEDGER SAMPLE(2.5) --this percent must always give > 100 items
WHERE DEPTID = '700064' and rownum < 101

My concern with doing that, is that my sample would not uniformly represent the entire population. For example if the sample function just pulls every N record after it creates its own randomly generated seed, then choosing the rownum < 101 will cut off all of the records chosen from the bottom of the table. What I am looking for is a way to pull out exactly 100 records from the table, which are randomly selected and fairly representative of the entire table. Please help!!

like image 206
user2284134 Avatar asked Apr 15 '13 21:04

user2284134


2 Answers

Borrowing jonearles' example table, I see exactly the same thing (in 11gR2 on an OEL developer image), usually getting values for a heavily skewed towards 1; with small sample sizes I can sometimes see none at all. With the extra randomisation/restriction step I mentioned in a comment:

select a, count(*) from (
    select * from test1 sample (1)
    order by dbms_random.value
)
where rownum < 101
group by a;

... with three runs I got:

         A   COUNT(*)
---------- ----------
         1         71
         2         29

         A   COUNT(*)
---------- ----------
         1        100

         A   COUNT(*)
---------- ----------
         1         64
         2         36

Yes, 100% really came back as 1 on the second run. The skewing itself seems to be rather random. I tried with the block modifier which seemed to make little difference, perhaps surprisingly - I might have thought it would get worse in this situation.

This is likely to be slower, certainly for small sample sizes, as it has to hit the entire table; but does give me pretty even splits fairly consistently:

select a, count(*) from (
    select a, b from (
        select a, b, row_number() over (order by dbms_random.value) as rn
        from test1
    )
    where rn < 101
)
group by a;

With three runs I got:

         A   COUNT(*)
---------- ----------
         1         48
         2         52

         A   COUNT(*)
---------- ----------
         1         57
         2         43

         A   COUNT(*)
---------- ----------
         1         49
         2         51

... which looks a bit healthier. YMMV of course.


This Oracle article covers some sampling techniques, and you might want to evaluate the ora_hash approach as well, and the stratified version if your data spread and your requirements for 'representativeness' demand it.

like image 70
Alex Poole Avatar answered Nov 18 '22 10:11

Alex Poole


You can't trust SAMPLE to return a truly random set of rows from a table. The algorithm appears to be based on the physical properties of the table.

create table test1(a number, b char(2000));

--Insert 10K fat records.  A is always 1.
insert into test1 select 1, level from dual connect by level <= 10000;

--Insert 10K skinny records.  A is always 2.
insert into test1 select 2, null from dual connect by level <= 10000;

--Select about 10 rows.
select * from test1 sample (0.1) order by a;

Run the last query multiple times and you will almost never see any 2s. This may be a accurate sample if you measure by bytes, but not by rows.

This is an extreme example of skewed data, but I think it's enough to show that RANDOM doesn't work the way the manual implies it should. As others have suggested, you'll probably want to ORDER BY DBMS_RANDOM.VALUE.

like image 40
Jon Heller Avatar answered Nov 18 '22 09:11

Jon Heller