Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ETL Processing Design and Performance

Tags:

c#

sql

database

etl

I am working on a ETL process for a data warehouse using C#, that supports both SQL Server and Oracle. During development I have been writing stored procedures that would synchronize data from one database to another database. The stored procedures code are rather ugly because it involves dynamic SQL. It needs to build the SQL strings since we have dynamic database name.

My team lead want to use C# code to do the ETL. We have code generation that automatic generate new classes when database definition changes. That's also why I decided not to use Rhino ETL.

Here are the pros and cons:

Stored Procedure:

Pros:

  • fast loading process, everything is handled by the database
  • easy deployment, no compiling is needed

Cons

  • poor readability due to dynamic SQL
  • Need to maintain both T-SQL and PL/SQL scripts when database definition changes
  • Slow development because no intellisense when writing dynamic SQL

C# Code:

Pros:

  • easier to develop the ETL process because we get intellisense from our generated class
  • easier to maintain because of generated class
  • better logging and error handling

Cons:

  • slow performance compare with stored procedure

I would prefer to use application code to do the ETL process, but the performance was horrible compare with stored procedures. In one test when I tries to update 10,000 row. The stored procedures took only 1 sec, while my ETL code took 70s. Even I somehow manage to reduce the overhead, 20% of the 70s are purely calling update statement from application code.

Could someone provide me suggestions or comment on how to speed up the ETL process using application code?

My next idea is try doing parallel ETL process by opening multiple database connections and perform the update and insert.

Thanks

like image 419
dsum Avatar asked Dec 02 '10 07:12

dsum


People also ask

What is ETL process design?

The Extract, Transform, and Load process (ETL for short) is a set of procedures in the data pipeline. It collects raw data from its sources (extracts), cleans and aggregates data (transforms) and saves the data to a database or data warehouse (loads), where it is ready to be analyzed.

What are the five main steps in the ETL process?

The 5 steps of the ETL process are: extract, clean, transform, load, and analyze. Of the 5, extract, transform, and load are the most important process steps. Clean: Cleans data extracted from an unstructured data pool, ensuring the quality of the data prior to transformation.


2 Answers

You say you have code generation that automatically generates new classes - why don't you have code generation that automatically generate new stored procedures?

That should give you the best of two worlds; encapsulate it into a few nice classes that can inspect the database and update things as necessary and you can, well not increase readability, but hide it (you would not need to update the SPs manually)

Also, the difference should not be so huge, sounds as if you are not doing something right (reusing connections, moving data unnecessary from server to the application or processing data in smaller batches - row by row?).

Also, regarding better logging - care to elaborate on that? You can have logging on the database layer, too, or you can design your SPs so that application layer can still do the logging.

like image 155
Unreason Avatar answered Sep 28 '22 06:09

Unreason


If your C# code is already slow with 10,000 rows, I cannot imagine it in a real environement...

Most ETL are done either within the database (stored procedures, packages, or even compiled within the database (PL/SQL, Java for Oracle)). They can handle millions of rows.

Or some professional tools can be used (Informatica, or others), but it will still be slower than stored procedures, but easier to manage.

So my conclusion is: If you want to come anywhere close to stored procedure performances, you will have to code an application as good as those professional ones on the market, that took years to develop and mature... Do you think you can?

Plus, if you have to handle different database types (SQL Server, Oracle), you CANNOT make a generic application AND optimize it at the same time, it's a choice. Because Oracle does not work the same way SQL Server does.

To give you an idea, in ETLs for Oracle, hints are used (like the Parallel Execution hints), and also some indexes may be dropped or integrity disabled temporarly to optimize the ETL.

There is no way that I know of to the the exact same thing in SQL Server (they might have similar options, but different syntax). So "one ETL for all databases" can hardly be done without losing efficiency and speed.

So I think your pros and cons are very accurate; you have to choose between speed and ease of development, but not both.

like image 42
guigui42 Avatar answered Sep 28 '22 06:09

guigui42