Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need a better option - outer joining 32 times to same table

I have a nasty SQL query problem and I'd love help with an elegant solution. I'm trying to avoid 32 left outer joins to the same table.

The database is Teradata.

I have a table with 14 million records and 33 columns. The primary key (let's call it Trans_Id), and 32 encoded fields (let's call them encoded_1 ... encoded_32). Something like this:

CREATE SET TABLE BigTable ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT
     (
      TRANS_ID VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC,
      ENCODED_1 VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      ENCODED_2 VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      ENCODED_3 VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      ...
      ENCODED_32 VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC )
PRIMARY INDEX ( TRANS_ID );

I also have a single table with the coded / decoded values. Let's say there are 100 records in this table.

CREATE SET TABLE LookupTable ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT
     (
      UNIQ_PK { just random numbers }
      ENCODED_VAR VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      DECODED_DESC VARCHAR(50) CHARACTER SET LATIN NOT CASESPECIFIC)
PRIMARY INDEX ( UNIQ_PK );

I want to avoid a nasty join like this (I used ellipses instead of showing all 32 outer joins):

SELECT 
TRANS_ID
, a.ENCODED_1
, b1.DECODED_DESC DECODED_DESC_1
, a.ENCODED_2
, b2.DECODED_DESC DECODED_DESC_2
...
, a.ENCODED_31
, b31.DECODED_DESC DECODED_DESC_31
, a.ENCODED_32
, b32.DECODED_DESC DECODED_DESC_32
FROM BigTable a
LEFT OUTER JOIN LookupTable b1 ON a.ENCODED_1 = b1.ENCODED
LEFT OUTER JOIN LookupTable b2 ON a.ENCODED_2 = b1.ENCODED
...
LEFT OUTER JOIN LookupTable b31 ON a.ENCODED_31 = b31.ENCODED
LEFT OUTER JOIN LookupTable b32 ON a.ENCODED_32 = b32.ENCODED

Any help would be appreciated. I have a feeling outer joining 14M records 32 times is not the efficient way to do this!

like image 846
LukeK Avatar asked May 12 '11 05:05

LukeK


People also ask

How do I avoid too many joins in SQL?

Using the Entity Framework Profiler, I've received the suggestion to reduce the number of joins and, instead, perform several separate queries: link. Each join requires the database to perform additional work, and the complexity and cost of the query grows rapidly with each additional join.

How do I join the same table twice?

To form a self-join, you specify the same table twice with different table aliases and provide the join predicate after the ON keyword. In this syntax, the table_name is joined to itself using the INNER JOIN clause.

How improve left outer join in SQL Server?

Push predicates into the OUTER JOIN clause whenever possible For SQL queries with the LEFT OUTER JOIN, pushing predicates of the right table from the WHERE clause into the ON condition helps the database optimizer generate a more efficient query. Predicates of the left table can stay in the WHERE clause.

How do I use the same table twice in SQL?

You use a single table twice in a query by giving it two names, like that. The aliases are often introduced with the keyword AS. You also normally specify a join condition (for without it, you get the Cartesian Product of the table joined with itself). For preference you use the explicit JOIN notation.


2 Answers

You could create a function that takes as a parameter the VARCHAR(10) encoded_var and return the VARCHAR(50) decoded_desc, then your select would be something like this:

SELECT TRANS_ID,
     ENCODED_1, somefunc(ENCODED_1) AS DECODED_DESC_1,
     ENCODED_2, somefunc(ENCODED_2) AS DECODED_DESC_2,
     etc.

Depending on the number of rows you are planning on returning at a time, this would be doable.

like image 50
user749976 Avatar answered Oct 19 '22 14:10

user749976


If encoded_1, encoded_2, etc are all being used as look up keys to the same table, it sounds like they are all the "same idea". But my first thought is that a better design in this case would be:

big_table (trans_id, var_id, encoded_var)
lookup_table (encoded_var, decoded_desc)

Then the query just becomes:

select trans_id, var_id, encoded_var, decoded_desc
from big_table
join lookup_table on lookup_table.encoded_var=big_table.encoded_var

I don't know if this is the real field name or if you're just trying to leave out irrelevant details. You may be leaving out relevant details here. What's the difference between encoded_1 and encoded_2, etc? If they are interchangeable, there is no reason to have separate fields for them. Indeed, it causes a lot of problems. Even if there is a semantic difference, if they all use the same lookup table, they must all be coming from the same domain.

For example, a few years ago I worked on a system to manage technical manuals that our organization produced and used. Each manual had 3 managers. (An administrative manager who handled budgets and schedules, a stock manager who kept track of who needed copies and made sure they got them, and a content manager responsible for the actual text.) But they were all drawn from the same list of people, as often the same person would have more than one of these roles or could have different roles for different manuals. So we made a table of "people" with an id, name, email address, etc, and then in the basic manual record I created 3 columns, one for each manager type.

This was a huge mistake. What I should have done was create a separate table with manual id, manager type id, and person id, and then have 3 RECORDS for the 3 manager types rather than 3 fields within one record.

Why? With three columns, I ran into the same problem you are describing, though on a smaller scale: I had to join from the manual table to the person table three times. A query like "what books is Bob Smith responsible for?" required a surprising complex query, something like

select ... whatever ...
from manual
join person p1 on p1.person_id=manual.admin_id
join person p2 on p2.person_id=manual.stockmanager_id
join person p3 on p3.person_id=manual.contentmanager_id
where p1.name='Bob Smith'
 or p2.name='Bob Smith'
 or p3.name='Bob Smith'

With a single column it would have been simply

select ... whatever ...
from manual
join manual_manager on manual_manager.manual_id=manual.manual_id
join person on person.person_id=manual_manager.person_id
where person.name='Bob Smith'"

With all the repetition, it was not surprising that there were a couple of times that a programmer accidentally only checked 2 of the fields instead of all 3. With 1 field this error wouldn't be possible. With 3 fields, if we added a 4th type of manager, we would have had to add another column, and then change every query that looked at these fields. With 1 field, we probably wouldn't. Etc.

With 3 fields we needed 3 indexes, and there are other performance implications.

I suspect the same sort of thinking applies to you.

If your 32 fields are all completely interchangeable, then the table would only need a sequence number to make a unique pk. If there is some difference between them, then you could create a code to distinguish them.

like image 26
Jay Avatar answered Oct 19 '22 13:10

Jay