Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the differences between Merge Join and Lookup transformations in SSIS?

People also ask

What is the difference between Lookup and join?

Lookup provides the solution with left outer join by default. Lookup provides the efficiency to retrieve respective columns. Whereas, join enables you to perform inner join, left outer join, right outer join and full outer join.

Which join is used in Lookup transformation in SSIS?

The Lookup transformation tries to perform an equi-join between values in Source 2 and Source 1 input. If there is no matching entry in the reference dataset, rows without matching entries are considered as errors by default.

What is Lookup transformation in SSIS?

The Lookup transformation performs lookups by joining data in input columns with columns in a reference dataset. You use the lookup to access additional information in a related table that is based on values in common columns.

What is the difference between merge and join in SQL?

Both join and merge can be used to combines two dataframes but the join method combines two dataframes on the basis of their indexes whereas the merge method is more versatile and allows us to specify columns beside the index to join on for both dataframes.


Screenshot #1 shows few points to distinguish between Merge Join transformation and Lookup transformation.

Regarding Lookup:

If you want to find rows matching in source 2 based on source 1 input and if you know there will be only one match for every input row, then I would suggest to use Lookup operation. An example would be you OrderDetails table and you want to find the matching Order Id and Customer Number, then Lookup is a better option.

Regarding Merge Join:

If you want to perform joins like fetching all Addresses (Home, Work, Other) from Address table for a given Customer in the Customer table, then you have to go with Merge Join because the customer can have 1 or more addresses associated with them.

An example to compare:

Here is a scenario to demonstrate the performance differences between Merge Join and Lookup. The data used here is a one to one join, which is the only scenario common between them to compare.

  1. I have three tables named dbo.ItemPriceInfo, dbo.ItemDiscountInfo and dbo.ItemAmount. Create scripts for these tables are provided under SQL scripts section.

  2. Tablesdbo.ItemPriceInfo and dbo.ItemDiscountInfo both have 13,349,729 rows. Both the tables have the ItemNumber as the common column. ItemPriceInfo has Price information and ItemDiscountInfo has discount information. Screenshot #2 shows the row count in each of these tables. Screenshot #3 shows top 6 rows to give an idea about the data present in the tables.

  3. I created two SSIS packages to compare the performance of Merge Join and Lookup transformations. Both the packages have to take the information from tables dbo.ItemPriceInfo and dbo.ItemDiscountInfo, calculate the total amount and save it to the table dbo.ItemAmount.

  4. First package used Merge Join transformation and inside that it used INNER JOIN to combine the data. Screenshots #4 and #5 show the sample package execution and the execution duration. It took 05 minutes 14 seconds 719 milliseconds to execute the Merge Join transformation based package.

  5. Second package used Lookup transformation with Full cache (which is the default setting). creenshots #6 and #7 show the sample package execution and the execution duration. It took 11 minutes 03 seconds 610 milliseconds to execute the Lookup transformation based package. You might encounter the warning message Information: The buffer manager has allocated nnnnn bytes, even though the memory pressure has been detected and repeated attempts to swap buffers have failed. Here is a link that talks about how to calculate lookup cache size. During this package execution, even though the Data flow task completed faster, the Pipeline cleanup took lot of time.

  6. This doesn't mean Lookup transformation is bad. It's just that it has to be used wisely. I use that quite often in my projects but again I don't deal with 10+ million rows for lookup everyday. Usually, my jobs handle between 2 and 3 millions rows and for that the performance is really good. Upto 10 million rows, both performed equally well. Most of the time what I have noticed is that the bottleneck turns out to be the destination component rather than the transformations. You can overcome that by having multiple destinations. Here is an example that shows the implementation of multiple destinations.

  7. Screenshot #8 shows the record count in all the three tables. Screenshot #9 shows top 6 records in each of the tables.

Hope that helps.

SQL Scripts:

CREATE TABLE [dbo].[ItemAmount](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [ItemNumber] [nvarchar](30) NOT NULL,
    [Price] [numeric](18, 2) NOT NULL,
    [Discount] [numeric](18, 2) NOT NULL,
    [CalculatedAmount] [numeric](18, 2) NOT NULL,
CONSTRAINT [PK_ItemAmount] PRIMARY KEY CLUSTERED ([Id] ASC)) ON [PRIMARY]
GO

CREATE TABLE [dbo].[ItemDiscountInfo](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [ItemNumber] [nvarchar](30) NOT NULL,
    [Discount] [numeric](18, 2) NOT NULL,
CONSTRAINT [PK_ItemDiscountInfo] PRIMARY KEY CLUSTERED ([Id] ASC)) ON [PRIMARY]
GO

CREATE TABLE [dbo].[ItemPriceInfo](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [ItemNumber] [nvarchar](30) NOT NULL,
    [Price] [numeric](18, 2) NOT NULL,
CONSTRAINT [PK_ItemPriceInfo] PRIMARY KEY CLUSTERED ([Id] ASC)) ON [PRIMARY]
GO

Screenshot #1:

1

Screenshot #2:

2

Screenshot #3:

3

Screenshot #4:

4

Screenshot #5:

5

Screenshot #6:

6

Screenshot #7:

7

Screenshot #8:

8

Screenshot #9:

9


A Merge Join is designed to produce results similar to how JOINs work in SQL. The Lookup component does not work like a SQL JOIN. Here's an example where the results would differ.

If you have a one-to-many relationship between input 1 (e.g., Invoices) and input 2 (e.g., Invoice Line Items), you want the results of the combining of these two inputs to include one or more rows for a single invoice.

With a Merge Join you will get the desired output. With a Lookup, where input 2 is the look up source, the output will be one row per invoice, no matter how many rows exist in input 2. I don't recall which row from input 2 the data would come, but I'm pretty sure you will get a duplicate-data warning, at least.

So, each component has its own role in SSIS.


I will suggest a third alternative to consider. Your OLE DBSource could contain a query rather than a table and you could do the join there. This is not good in all situations but when you can use it then you don't have to sort beforehand.


Lookup is similar to left-join in Merge Join component. Merge can do other types of joins, but if this is what you want, the difference is mostly in performance and convenience.

Their performance characteristics can be very different depending on relative amount of data to lookup (input to lookup component) and amount of referenced data (lookup cache or lookup data source size).

E.g. if you only need to lookup 10 rows, but referenced data set is 10 millions rows - Lookup using partial-cache or no-cache mode will be faster as it will only fetch 10 records, rather than 10 millions. If you need to lookup 10 millions rows, and referenced data set is 10 rows - fully cached Lookup is probably faster (unless those 10 millions rows are already sorted anyway and you can try Merge Join). If both data sets are large (especially if more than available RAM) or the larger one is sorted - Merge might be better choice.


there are 2 differences:

  1. Sorting:

    • a merge join requires both inputs to be sorted the same way
    • lookup does not require either input to be sorted.
  2. Database query load:

    • a merge join does not refer to the database , just the 2 input flows (although the reference data is typically in the form of 'select * from table order by join critera' )
    • lookup will issue 1 query for each (distinct, if cached) value that it is being asked to join on. This rapidly becomes more expensive than the above select.

This leads to: if it is no effort to produce a sorted list, and you want more than about 1% of the rows (single row selects being ~100x the cost of the same row when streaming) (you don't want to sort a 10 million row table in memory ..) then merge join is the way to go.

If you only expect a small number of matches (distinct values looked up, when caching is enabled) then lookup is better.

For me, the tradeoff between the two comes between 10k and 100k rows needing to be looked up.

The one which is quicker will depend on

  • the total number of rows to be processed. (if the table is memory resident, a sort of the data to merge it is cheap)
  • the number of duplicate lookups expected. (high per-row overhead of lookup)
  • if you can select sorted data (note, text sorts are influence by code collation, so be careful that what sql considers sorted is also what ssis considers sorted)
  • what percentage of the entire table you will look up. (merge will require selecting every row, lookup is better if you only have a few rows on one side)
  • the width of a row (rows per page can strongly influences the io cost of doing single lookups vs a scan) (narrow rows -> more preference for merge)
  • the order of data on disk (easy to produce sorted output, prefer merge, if you can organised the lookups to be done in physical disk order, lookups are less costly due to less cache misses)
  • network latency between the ssis server and the destination (larger latency -> prefer merge)
  • how much coding effort you wish to spend (merge is a bit more complex to write)
  • the collation of the input data -- SSIS merge has wierd ideas about sorting of text strings which contain non-alphanumeric characters, but are not nvarchar. (this goes to sorting, and getting sql to emit a sort which ssis is happy to merge is hard)

Merge Join allows you to join to multiple columns based on one or more criterion, whereas a Lookup is more limited in that it only fetches a one or more values based on some matching column information -- the lookup query is going to be run for each value in your data source (though SSIS will cache the data source if it can).

It really depends on what your two data sources contain and how you want your final source to look after the merge. Could you provide any more details about the schemas in your DTS package?

Another thing to consider is performance. If used incorrectly, each could be slower than the other, but again, it's going to depend on the amount of data you have and your data source schemas.