Generally speaking, if I need to retrieve information from a single table with a large number of records, is it better to do so as multiple queries limiting by the primary key and executing in parallel, or is better to do it as one big select?
E.g. if I have the following table, ERRORS, with at least 5 million rows:
ERROR_NUM ERROR_CODE CREATION_DATE STATE ... 1 55 09.20.2013 DE 2 23 09.21.2013 CA 3 55 09.21.2013 MI 4 24 09.22.2013 DE ... 5000000 67 11.18.2013 NY
If I need to get information like the errors per state per day, is it better to do a single SELECT what_i_need FROM errors, or should I split it into:
SELECT what_i_need FROM errors WHERE error_num BETWEEN 1 AND 100000 SELECT what_i_need FROM errors WHERE error_num BETWEEN 100001 and 200000 ...etc
Then combine programmatically once the data is retrieved?
Use a single SELECT statement, make sure that the table is indexed properly and use the correct grouping operators at the database level. The indexes and the grouping need to match each other - depending on the actual queries, the DBMS might be able to take huge advantages of an index. As always, measure, measure, measure. Use a realistic set of test data (not only of the same size, but also of the same value distribution!) and have the system write SQL traces while performing the queries. Then analyze the results to see whether the query can be optimized. Trying to split the query "horizontally" will generally not speed things up, and it might make things a lot worse because the parallel jobs might "block" each other, creating a huge I/O hog.
I would go with one single select for that. It's generally not the amount of rows that slows things down, but the columns and their content.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With