I use Talend to load data into a sql-server database. It appears that the weakest point of my job is not the dataprocessing, but the effective load in my database, which is not faster than 17 rows/sec. The funny point is that I can launch 5 jobs in the same time, and they'll all load at 17rows/sec . What could explain this slowness and how could I improve the speed? Thanks New informations: The transfer speed between my desktop and the server is about 1MByte My job commits every 10 000 I use sql server 2008 R2 And the schema I use for my jobs is like this: <img src="https://i.stack.imgur.com/QyLtH.jpg" alt="enter image description here">

Database <code>INSERT OR UPDATE</code> methods are incredibly costly as the database cannot batch all of the commits to do all at once and must do them line by line (ACID transactions force this because if it attempted to do an insert and then failed then all of the other records in this commit would also fail). Instead, for large bulk operations it is always best to predetermine whether a record would be inserted or updated before passing the commit to the database and then sending 2 transactions to the database. A typical job that needed this functionality would assemble the data that is to be <code>INSERT OR UPDATEd</code> and then query the database table for the existing primary keys. If the primary key already exists then you can send this as an <code>UPDATE</code>, otherwise it is an <code>INSERT</code>. The logic for this can be easily done in a <code>tMap</code> component. <img src="https://i.stack.imgur.com/PKaZY.png" alt="Insert or Update Job Example"> In this job we have some data that we wish to <code>INSERT OR UPDATE</code> into a database table that contains some pre-existing data: <img src="https://i.stack.imgur.com/tv6pf.png" alt="Initially loaded data"> And we wish to add the following data to it: <img src="https://i.stack.imgur.com/YnuOL.png" alt="Insert or Update data"> The job works by throwing the new data into a <code>tHashOutput</code> component so it can be used multiple times in the same job (it simply puts it to memory or in large instances can cache it to the disk). Following on from this one lot of data is read out of a <code>tHashInput</code> component and directly into a <code>tMap</code>. Another <code>tHashInput</code> component is utilised to run a parameterised query against the table: <img src="https://i.stack.imgur.com/Hi529.png" alt="Parameterised Query"><img src="https://i.stack.imgur.com/b0A17.png" alt="Parameter Config"> You may find this guide to Talend and parameterised queries useful. From here the returned records (so only the ones inside the database already) are used as a lookup to the <code>tMap</code>. This is then configured as an <code>INNER JOIN</code> to find the records that need to be <code>UPDATED</code> with the rejects from the <code>INNER JOIN</code> to be inserted: <img src="https://i.stack.imgur.com/V2z9O.png" alt="tMap configuration"> These outputs then just flow to separate <code>tMySQLOutput</code> components to <code>UPDATE</code> or <code>INSERT</code> as necessary. And finally when the main subjob is complete we <code>commit</code> the changes.

I think that @ydaetskcoR 's answer is perfect on a teorical point of view (divide rows that need Insert from those to Update) and gives you a working ETL solution useful for small dataset (some thousands rows). Performing the lookup to be able to decide wheter a row has to be updated or not is costly in ETL as all the data is going back and forth between the Talend machine and the DB server. When you get to some hundred of thousands o even millions of records you have to pass from ETL to ELT: you just load your data to some temp (staging) table as suggested from @Balazs Gunics and then you use SQL to manipulate it. In this case after loading your data (only INSERT = fast, even faster using BULK LOAD components) you will issue a LEFT OUTER JOIN between the temp table and the destination one to divide the rows that are already there (need update) and the others. This query will give you the rows you need to insert: <pre class="prettyprint"><code>SELECT staging.* FROM staging LEFT OUTER JOIN destination ON (destination.PK = staging.PK) WHERE destination.PK IS NULL </code></pre> This other one the rows you need to update: <pre class="prettyprint"><code>SELECT staging.* FROM staging LEFT OUTER JOIN destination ON (destination.PK = staging.PK) WHERE destination.PK IS NOT NULL </code></pre> This will be orders of magnitude faster than ETL, BUT you will need to use SQL to operate on your data, while in ETL you can use Java as ALL the data is taken to the Talend server, so often is common a first step on the local machine to pre-process the data in java (to clean and validate it) and then fire it up on the DB where you use join to load it in the right way. Here are the ELT JOB screen shots. <img src="https://i.stack.imgur.com/YfnZq.png" alt="INSERT or UPDATE ELT job"> <img src="https://i.stack.imgur.com/hrnSV.png" alt="How to distinguish between rows to insert or update">

how to load data faster with talend and sql server

2 Answers

Database INSERT OR UPDATE methods are incredibly costly as the database cannot batch all of the commits to do all at once and must do them line by line (ACID transactions force this because if it attempted to do an insert and then failed then all of the other records in this commit would also fail).

Instead, for large bulk operations it is always best to predetermine whether a record would be inserted or updated before passing the commit to the database and then sending 2 transactions to the database.

A typical job that needed this functionality would assemble the data that is to be INSERT OR UPDATEd and then query the database table for the existing primary keys. If the primary key already exists then you can send this as an UPDATE, otherwise it is an INSERT. The logic for this can be easily done in a tMap component.

Insert or Update Job Example

In this job we have some data that we wish to INSERT OR UPDATE into a database table that contains some pre-existing data:

Initially loaded data

And we wish to add the following data to it:

Insert or Update data

The job works by throwing the new data into a tHashOutput component so it can be used multiple times in the same job (it simply puts it to memory or in large instances can cache it to the disk).

Following on from this one lot of data is read out of a tHashInput component and directly into a tMap. Another tHashInput component is utilised to run a parameterised query against the table:

Parameterised Query Parameter Config

You may find this guide to Talend and parameterised queries useful. From here the returned records (so only the ones inside the database already) are used as a lookup to the tMap.

This is then configured as an INNER JOIN to find the records that need to be UPDATED with the rejects from the INNER JOIN to be inserted:

tMap configuration

These outputs then just flow to separate tMySQLOutput components to UPDATE or INSERT as necessary. And finally when the main subjob is complete we commit the changes.

133

answered Sep 21 '22 12:09

ydaetskcoR

I think that @ydaetskcoR 's answer is perfect on a teorical point of view (divide rows that need Insert from those to Update) and gives you a working ETL solution useful for small dataset (some thousands rows).

Performing the lookup to be able to decide wheter a row has to be updated or not is costly in ETL as all the data is going back and forth between the Talend machine and the DB server.

When you get to some hundred of thousands o even millions of records you have to pass from ETL to ELT: you just load your data to some temp (staging) table as suggested from @Balazs Gunics and then you use SQL to manipulate it.

In this case after loading your data (only INSERT = fast, even faster using BULK LOAD components) you will issue a LEFT OUTER JOIN between the temp table and the destination one to divide the rows that are already there (need update) and the others.

This query will give you the rows you need to insert:

SELECT staging.* FROM staging
LEFT OUTER JOIN destination ON (destination.PK = staging.PK)
WHERE destination.PK IS NULL

This other one the rows you need to update:

SELECT staging.* FROM staging
LEFT OUTER JOIN destination ON (destination.PK = staging.PK)
WHERE destination.PK IS   NOT    NULL

This will be orders of magnitude faster than ETL, BUT you will need to use SQL to operate on your data, while in ETL you can use Java as ALL the data is taken to the Talend server, so often is common a first step on the local machine to pre-process the data in java (to clean and validate it) and then fire it up on the DB where you use join to load it in the right way.

Here are the ELT JOB screen shots. INSERT or UPDATE ELT job

How to distinguish between rows to insert or update

answered Sep 23 '22 12:09

RobMcZag

Related questions
                            
                                How Does Dateadd Impact the Performance of a SQL Query?
                            
                                What disadvantages are there for leaving an SQL Connection open?
                            
                                How do I group by min value in one field of table, keeping all the values from that same row?
                            
                                How to count groups of rows and display the top/bottom 3
                            
                                Red Gate SQL Compare vs. VS2010 Ultimate
                            
                                Multiple row insert in SQL Server from Java [duplicate]
                            
                                SQL CTE and ORDER BY affecting result set
                            
                                Why is my new schema not showing in the table properites pane?
                            
                                Reset SCOPE_IDENTITY()
                            
                                Concatenate sql values to a variable
                            
                                Get count percent of a record in a single query
                            
                                Good idea to specify schema as part of constraint name?
                            
                                Cannot truncate table because it is being referenced by a FOREIGN KEY constraint
                            
                                SQL Server query error
                            
                                SQL Real vs Float
                            
                                Select items like records from a column in another table
                            
                                SQL QUERY using LEFT JOIN and CASE Statement
                            
                                Dynamic SQL taking much much longer than the hard-coded equivalent
                            
                                SQL Server stored procedure that returns a boolean if table exists, c# implementation
                            
                                SQL Server : on update set current timestamp

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to load data faster with talend and sql server

Tags:

sql-server

upsert

database-performance

talend

Krowar

People also ask

2 Answers

ydaetskcoR

RobMcZag

Recent Activity

Donate For Us