I don't think there's an easy way to do this, but on the off chance that there is ...
I am given a number of lists of around 10000 records each from a 10 million record table. The data is currently generated by queries on various non-indexed elements. I want to automatically build queries that give the same results, using ten separate indexed fields.
Is there a known algorithm for building something like this? Beyond the basics of including each indexed 'node' with its own OR, I mean.
E.g., assuming the data wanted is:
Letter, Number
A, 1
A, 2
B, 1
C, 2
and the original database has
Letter, Number
A, 1
A, 2
A, 3
B, 1
C, 1
C, 2
D, 1
D, 3
I'd like something like:
WHERE ((Letter = 'A' OR Letter = 'B') AND (Number = 1 OR Number = 2))
OR (Letter = 'C' and Number = 2)
Or maybe
WHERE (Letter IN ('A', 'B', 'C') AND Number IN (1, 2)
AND NOT (Number = 1 AND Letter = 'C'))
But I think I'd rather not have
WHERE (Letter = 'A' AND Number = '1') OR
(Letter = 'A' AND Number = '2') OR
(Letter = 'B' AND Number = '1') OR
(Letter = 'C' AND Number = '2')
-- unless the database experts here think that that would be much more optimized in the long term, for the sample size we're talking about. Run time of the queries is important; run time of the conversion tool is not. I also don't need to necessarily get the 'best' answer; 'good enough' is acceptable.
My current plan is to count, sort and iterate through looking for things that can be grouped together, to try to make as few 'groupings' as possible; I think I'd rather not have ten thousand (A and B and C and D and E and F and G and H and I and J)'s ORed together.
Thoughts? Expert advice?
Sorry this is not really an answer to your question, but rather my own musings on the problem.
I'd suggest storing your lists in a separate table. That'll enable you to do a joined select from the two tables in the end. You may or may not use indexes on the filter table, depending on the performance tests with your data.
The exact implementation would differ given the particular RDMBS you intend to use. In my example I'll stick to Oracle, as it's what I know best.
CREATE TABLE t_filter_lists (
f_letter varchar2(1),
f_number number
);
-- Optionally, create an index:
CREATE INDEX ix_filter_lists
ON t_filter_lists (
f_letter,
f_number
);
INSERT INTO t_filter_lists (f_letter, f_number) VALUES ('A', 1);
INSERT INTO t_filter_lists (f_letter, f_number) VALUES ('A', 2);
INSERT INTO t_filter_lists (f_letter, f_number) VALUES ('B', 1);
INSERT INTO t_filter_lists (f_letter, f_number) VALUES ('C', 2);
COMMIT;
-- (Oracle-specific part) gather statistics on the filter table
EXEC DMBS_STATS.GATHER_TABLE_STATS(...
-- Run your query
SELECT *
FROM t_your_table t
INNER JOIN t_filter_lists f
ON f.f_letter = t.t_letter
AND f.f_number = t.t_number;
The benefit of this solution is that, given that the table and index statistics are complete and fresh, you will not have the headache to choose the correct order of the predicates depending on which and how columns are indexed, in which order, what is their estimated cardinality etc. The optimizer will do that work for you, and it should be quite good at it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With