ETL seems to be a pretty common task. I am basically reading some ETL mistakes which designers make with very large data on http://it.toolbox.com/blogs/infosphere/17-mistakes-that-etl-designers-make-with-very-large-data-19264
I need some practical insights for the following points
a) Incorporating Inserts, Updates, and Deletes in to the same data flow / same process.. How is that a problem?
b) Sourcing multiple systems at the same time, depending on heterogeneous systems of data.
c) Not producing the correct indexes on the sources/ lookups that need to be accessed.
d) Believing that ‘ I need to process all the data in one pass because it’s the fastest way to do it ‘
Any help?
a) Data integrity issue
b) data quality will increase and less failure for smaller chunks.
c) will take more time to complete<
d) wrong indexes can cause more time. Better have indexes based on the query you are executing. i.e what comes in the where clause of statement
e) splitting the data into smaller data sets and processing the same would be an efficient solution
Your a BITS-PILANI(WILP) student rite.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With