Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TABLESAMPLE returns wrong number of rows?

I've just discovered the TABLESAMPLE clause but surprisingly it doesn't return the number of rows i've specified.

The table that i've used has ~14M rows and i wanted an arbitrary sample of 10000 rows.

select * from tabData TABLESAMPLE(10000 ROWS)

I get not 10000 but a different number everytime i execute it(between 8000 and 14000).

What's going on here, have i misunderstood the intended purpose of TABLESAMPLE?

Edit:

David's link explains it pretty well.

This returns always 10000 roughly random rows in an efficient way:

select TOP 10000 * from tabData TABLESAMPLE(20000 ROWS);

and the REPEATABLE option helps to get always the same (unless data has changed)

select TOP 10000 * from tabData TABLESAMPLE(10000 ROWS) REPEATABLE(100);

Since i wanted to know if it's more expensive to use TABLESAMPLE with a large number of rows to ensure(?) that i get the correct row-number, i've measured it;

1.loop (20 times):

select TOP 10000 * from tabData TABLESAMPLE(10000 ROWS);

(9938 row(s) affected)
(10000 row(s) affected)
(9383 row(s) affected)
(9526 row(s) affected)
(10000 row(s) affected)
(9545 row(s) affected)
(9560 row(s) affected)
(9673 row(s) affected)
(9608 row(s) affected)
(9476 row(s) affected)
(9766 row(s) affected)
(10000 row(s) affected)
(9500 row(s) affected)
(9941 row(s) affected)
(9769 row(s) affected)
(9547 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(9478 row(s) affected)
First batch(only 10000 rows) completed in: 14 seconds!

2.loop (20 times):

select TOP 10000 * from tabData TABLESAMPLE(10000000 ROWS);

(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
Second batch(max rows) completed in: 13 seconds!

3.loop: counterscheck with 100% random rows using ORDER BY NEWID():

select TOP 10000 * from tabData ORDER BY NEWID();

(10000 row(s) affected)

Cancelled after one execution that lasted 23 minutes

Conclusion:

So suprisingly the approach with an exact TOP clause and a large number in TABLESAMPLE is not slower. Hence it's a very efficient alternative to ORDER BY NEWID() if it doesn't matter that the rows are not random per row but per page level(Each 8K page for the table is given a random value).

like image 226
Tim Schmelter Avatar asked May 23 '12 18:05

Tim Schmelter


3 Answers

See the article here. You need to add a top clause and/or use the repeatable option to get the number of rows you want.

like image 58
David Brabant Avatar answered Sep 25 '22 14:09

David Brabant


From the documentation.

The actual number of rows that are returned can vary significantly. If you specify a small number, such as 5, you might not receive results in the sample.

http://msdn.microsoft.com/en-us/library/ms189108(v=sql.90).aspx

like image 40
paparazzo Avatar answered Sep 23 '22 14:09

paparazzo


This behavior has been documented before. There is a good writeup on it here.

I believe you can fix it by passing REPEATABLE with the same seed each time. Here is a snippit from the writeup:

...you will notice that different numbers of rows are returned everytime. Without any data changing, re-running the identical query keeps giving different results. This is non -deterministic factor of TABLESAMEPLE clause. If table is static and rows are not changed what could be the reason to return different numbers of the rows to return in each execution. The factor is 10 PERCENT is not the percentages of the table rows or tables records, it is percentages of the table’s data pages. Once the sample pages of data selected, all the rows from the selected pages are returned, it will not limit the number of rows sampled from that page. Fill factor of all the pages varies depends on the data of the table. This makes script to return different row count in result set everytime it is executed. The REPEATABLE option causes a selected sample to be returned again. When REPEATABLE is specified with the same repeat_seed value, SQL Server returns the same subset of rows, as long as no changes have been made to the table. When REPEATABLE is specified with a different repeat_seed value, SQL Server will typically return a different sample of the rows in the table. .

like image 38
Abe Miessler Avatar answered Sep 22 '22 14:09

Abe Miessler