I am working on a ETL process for a data warehouse using C#, that supports both SQL Server and Oracle. During development I have been writing stored procedures that would synchronize data from one database to another database. The stored procedures code are rather ugly because it involves dynamic SQL. It needs to build the SQL strings since we have dynamic database name. My team lead want to use C# code to do the ETL. We have code generation that automatic generate new classes when database definition changes. That's also why I decided not to use Rhino ETL. Here are the pros and cons: Stored Procedure: Pros: <ul> <li>fast loading process, everything is handled by the database</li> <li>easy deployment, no compiling is needed</li> </ul> Cons <ul> <li>poor readability due to dynamic SQL</li> <li>Need to maintain both T-SQL and PL/SQL scripts when database definition changes</li> <li>Slow development because no intellisense when writing dynamic SQL</li> </ul> C# Code: Pros: <ul> <li>easier to develop the ETL process because we get intellisense from our generated class</li> <li>easier to maintain because of generated class</li> <li>better logging and error handling</li> </ul> Cons: <ul> <li>slow performance compare with stored procedure</li> </ul> I would prefer to use application code to do the ETL process, but the performance was horrible compare with stored procedures. In one test when I tries to update 10,000 row. The stored procedures took only 1 sec, while my ETL code took 70s. Even I somehow manage to reduce the overhead, 20% of the 70s are purely calling update statement from application code. Could someone provide me suggestions or comment on how to speed up the ETL process using application code? My next idea is try doing parallel ETL process by opening multiple database connections and perform the update and insert. Thanks

You say you have code generation that automatically generates new classes - why don't you have code generation that automatically generate new stored procedures? That should give you the best of two worlds; encapsulate it into a few nice classes that can inspect the database and update things as necessary and you can, well not increase readability, but hide it (you would not need to update the SPs manually) Also, the difference should not be so huge, sounds as if you are not doing something right (reusing connections, moving data unnecessary from server to the application or processing data in smaller batches - row by row?). Also, regarding better logging - care to elaborate on that? You can have logging on the database layer, too, or you can design your SPs so that application layer can still do the logging.

ETL Processing Design and Performance

Tags:

c#

sql

database

etl

I am working on a ETL process for a data warehouse using C#, that supports both SQL Server and Oracle. During development I have been writing stored procedures that would synchronize data from one database to another database. The stored procedures code are rather ugly because it involves dynamic SQL. It needs to build the SQL strings since we have dynamic database name.

My team lead want to use C# code to do the ETL. We have code generation that automatic generate new classes when database definition changes. That's also why I decided not to use Rhino ETL.

Here are the pros and cons:

Stored Procedure:

Pros:

fast loading process, everything is handled by the database
easy deployment, no compiling is needed

Cons

poor readability due to dynamic SQL
Need to maintain both T-SQL and PL/SQL scripts when database definition changes
Slow development because no intellisense when writing dynamic SQL

C# Code:

Pros:

easier to develop the ETL process because we get intellisense from our generated class
easier to maintain because of generated class
better logging and error handling

Cons:

slow performance compare with stored procedure

I would prefer to use application code to do the ETL process, but the performance was horrible compare with stored procedures. In one test when I tries to update 10,000 row. The stored procedures took only 1 sec, while my ETL code took 70s. Even I somehow manage to reduce the overhead, 20% of the 70s are purely calling update statement from application code.

Could someone provide me suggestions or comment on how to speed up the ETL process using application code?

My next idea is try doing parallel ETL process by opening multiple database connections and perform the update and insert.

Thanks

419

asked Dec 02 '10 07:12

dsum

2 Answers

You say you have code generation that automatically generates new classes - why don't you have code generation that automatically generate new stored procedures?

That should give you the best of two worlds; encapsulate it into a few nice classes that can inspect the database and update things as necessary and you can, well not increase readability, but hide it (you would not need to update the SPs manually)

Also, the difference should not be so huge, sounds as if you are not doing something right (reusing connections, moving data unnecessary from server to the application or processing data in smaller batches - row by row?).

Also, regarding better logging - care to elaborate on that? You can have logging on the database layer, too, or you can design your SPs so that application layer can still do the logging.

155

answered Sep 28 '22 06:09

Unreason

If your C# code is already slow with 10,000 rows, I cannot imagine it in a real environement...

Most ETL are done either within the database (stored procedures, packages, or even compiled within the database (PL/SQL, Java for Oracle)). They can handle millions of rows.

Or some professional tools can be used (Informatica, or others), but it will still be slower than stored procedures, but easier to manage.

So my conclusion is: If you want to come anywhere close to stored procedure performances, you will have to code an application as good as those professional ones on the market, that took years to develop and mature... Do you think you can?

Plus, if you have to handle different database types (SQL Server, Oracle), you CANNOT make a generic application AND optimize it at the same time, it's a choice. Because Oracle does not work the same way SQL Server does.

To give you an idea, in ETLs for Oracle, hints are used (like the Parallel Execution hints), and also some indexes may be dropped or integrity disabled temporarly to optimize the ETL.

There is no way that I know of to the the exact same thing in SQL Server (they might have similar options, but different syntax). So "one ETL for all databases" can hardly be done without losing efficiency and speed.

So I think your pros and cons are very accurate; you have to choose between speed and ease of development, but not both.

answered Sep 28 '22 06:09

guigui42

Related questions
                            
                                Peculiar result relating to struct size and performance
                            
                                Violation of UNIQUE KEY Contstraints - Cannot insert duplicate key
                            
                                Improve performance of SHA-1 ComputeHash
                            
                                WCF Connections exceeding max connections when using Asynchronous pattern
                            
                                Sample MVC / MVP winforms Application (Non Trivial)
                            
                                How do you desearialize a bool from Xml with custom true and false values?
                            
                                C# - Listview colum header height (Windows Form)
                            
                                Using statements before or after Namespace in C# [duplicate]
                            
                                Sending money from [SomePaymentProcesingCompany] to bank account
                            
                                Can a type be a reference type and a value type at the same time?
                            
                                Regex C# problem
                            
                                Preserving special chars in xml
                            
                                Where is cached information stored in ASP.NET?
                            
                                Signalling to a parent process that a child process is fully initialised
                            
                                Best option to custom Authentication using ASP .NET MVC (Cache, Cookie...)
                            
                                Is it possible to custom draw a ListViewGroup
                            
                                Silverlight media player position problem
                            
                                Multi-threading access to MapPoint?
                            
                                C# - Is it possible (and how) to perform XSL transformations using SgmlReader
                            
                                C# memory leak?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With