Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join conditions on multiple columns versus single join on concatenated columns?

I observe there are two ways to achieve an objective on multiple tables. A column in the resultset will be updated and speed may be a requirement. The resultset is obtained either by:

Case 1:

select ert.* 
from eval_rep_track ert
inner join 
(
        select erp.evaluation_fk, erp.report_type, LTRIM(erp.assign_group_id, '/site/') course_name
        from eval_report_dup@prod erp
        inner join eval_report er
        on er.id = erp.id
        where erp.status='queue'
        and er.status='done'
) cat

on ert.eval_id || '.' || ert.report_type || '.' || ert.course_name = cat.evaluation_fk || '.' || cat.report_type || '.' || cat.course_name;

OR

Case 2:

select ert.* 
from eval_rep_track ert
inner join 
(
        select erp.evaluation_fk, erp.report_type, LTRIM(erp.assign_group_id, '/site/') course_name
        from eval_report_dup@prod erp
        inner join eval_report er
        on er.id = erp.id
        where erp.status='queue'
        and er.status='done'
) cat
on ert.eval_id = cat.evaluation_fk  
and ert.report_type = cat.report_type  
and ert.course_name = cat.course_name;

with both giving the same result, with only the join condition varying. Which will run/exec faster?

eval_id is NUMBER, report_type and course_name are VARCHAR2.

From the developer used, case 1 has the following stats: [SELECT - 3077 row(s), 0.048 secs] Result set fetched ... 1 statement(s) executed, 3077 row(s) affected, exec/fetch time: 0.048/0.236 sec [1 successful, 0 warnings, 0 errors]

while case 2: [SELECT - 3077 row(s), 0.019 secs] Result set fetched ... 1 statement(s) executed, 3077 row(s) affected, exec/fetch time: 0.019/0.194 sec [1 successful, 0 warnings, 0 errors]

The results suggest case 2 is faster. Will this be universal across any platform (ide, developer) and database? Is this dependent on the data-type or is concatenation always expensive? I don't actually need the result of the concatenation. Thanks.

like image 921
paxmemento Avatar asked Dec 15 '22 14:12

paxmemento


2 Answers

I think the version with concatenation will practically always be slower.

If any of the columns you're comparing individually have indexes, the database will normally be able to use the indexes to optimize the join. When you compare concatenations, it has to perform full table scans, because the result of a calculation won't be in the index.

And even if the columns are not indexed, the database can still perform the comparisons more efficiently. It compares one pair of columns at a time, and can stop as soon as one of those comparisons fails. When using the concatenation, it has to first combine all the columns, in both rows, then do a string comparison.

Finally, if any of the columns are numeric, the concatenation will require the additional step of converting the number to a string.

like image 101
Barmar Avatar answered Dec 31 '22 12:12

Barmar


Quite simply, joining on the individual columns is correct. Joining to concatenated values is incorrect. Separate from any discussion of performance, you should write correct code.

For any particular query, you can probably write a mostly correct query using concatenation. But you'll almost certainly introduce subtle bugs that will bite you when you get data that you don't expect. In this case, as soon as your columns contains a period, there is the potential that you'd match data incorrectly ('a.b' || '.' || null = 'a' || '.' || 'b.'). In other cases, you'll have other subtle issues-- dates and numbers may be converted to strings implicitly using different session-level settings that may produce different results (your NLS_DATE_FORMAT might contain the time component or it might not so your concatenated value may or may not include a comparison of the time). If you concatenate columns commonly, you're going to end up with lots of queries that have very subtle bugs based on the data in the table and the users executing the code. That's terrible from a maintenance and support perspective. Performance should be, at most, a secondary concern.

From a performance perspective, a proper join will almost certainly outperform the concatenation approach. The optimizer would be able to consider normal indexes on the various columns that are part of the join when generating the query plan when you're joining correctly. If you're concatenating values, at best Oracle might be able to do a full scan of a normal index to get all the data that needs to be concatenated together. But this is likely to be much less efficient (particularly when you've got more than a few thousand rows).

Is it theoretically possible that the concatenation approach would be more efficient for some query somewhere? Sure. A sadistic developer might create a function-based index on the concatenated result, avoid creating an index on the individual columns, and produce a test case where the concatenation approach is more efficient. That would be easily remedied, however, by creating the appropriate corresponding index (or indexes) on the basic columns. Is it possible that concatenation would be more efficient for some query because it prevents the optimizer from using an index that it otherwise wants to use? Sure. But that almost certainly indicates that you have a problem with optimizer settings or statistics that should be addressed rather than throwing a band-aid on the problem.

like image 27
Justin Cave Avatar answered Dec 31 '22 12:12

Justin Cave