Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to approach an ETL mission?

I am supposed to perform ETL where source is a large and badly designed sql 2k database and a a better designed sql 2k5 database. I think SSIS is the way to go. Can anyone suggest a to-do list or a checklist or things to watchout for so that I dont forget anything? How should I approach this so that it does not bite me in the rear later on.

like image 511
Perpetualcoder Avatar asked Jan 13 '09 19:01

Perpetualcoder


1 Answers

Some general ETL tips

  1. Consider organising it by destination (for example, all the code to produce the Customer dimension lives in the same module, regardless of source). This is sometimes known as Subject-oriented ETL. It makes finding stuff much easier and will increase the maintainability of your code.

  2. If the SQL2000 database is a mess, you will probably find that SSIS data flows are a clumsy way to deal with the data. As a rule, ETL tools scale poorly with complexity; something like half of all data warehouse projects in finance companies are done with stored procedure code as an explicit architectural decision - for precisely this reason. If you have to put a large amount of code in sprocs, consider putting all of the code in sprocs.

    For a system involving lots of complex scrubbing or transformations, a 100% sproc approach is far more maintainable as it is the only feasible way to put all of the transformations and business logic in one place. With mixed ETL/sproc systems, you have to look in multiple places to track, troubleshoot, debug or change the whole transformation.

  3. The sweet spot of ETL tools is on systems where you have a larger number of data sources with relatively simple transformations.

  4. Make the code testable, so you can pick apart the components and test in isolation. Code that can only be executed from within the middle of a complex data flow in an ETL tool is much harder to test.

  5. Make the data extract dumb with no business logic, and copy into a staging area. If you have business logic spread across the extract and transform layers, you will have transformations that cannot be tested in isolation and make it hard to track down bugs. If the transform is running from a staging area you reduce the hard dependency on the source system, again enhancing testability. This is a particular win on sproc-based architectures as it allows an almost completely homogeneous code base.

  6. Build a generic slowly-changing dimension handler or use one off the shelf if available. This makes it easier to unit test this functionality. If this can be unit tested, the system testing does not have to test all of the corner cases, merely whether the data presented to it is correct. This is not as complex as it sounds - The last one I wrote was about 600 or 700 lines of T-SQL code. The same goes for any generic scrubbing functions.

  7. Load incrementally if possible.

  8. Instrument your code - have it make log entries, possibly recording diagnostics such as check totals or counts. Without this, troubleshooting is next to impossible. Also, assertion checking is a good way to think of error handling for this (does row count in a equal row count in b, is A:B relationship really 1:1).

  9. Use synthetic keys. Using natural keys from the source systems ties your system to the data sources, and makes it difficult to add extra sources. The keys and relationships in the system should always line up - no nulls. For errors, 'not recorded', make a specific 'error' or 'not recorded' entries in the dimension table and match to them.

  10. If you build an Operational Data Store (the subject of many a religious war) do not recycle the ODS keys in the star schemas. By all means join on ODS keys to construct dimensions, but match on a natural key. This allows you to arbitrarily drop and recreate the ODS - possibly changing its structure - without disturbing the star schemas. Having this capability is a real maintenance win, as you can change ODS structure or do a brute-force re-deployment of the ODS at any point.

Points 1-2 and 4-5 mean that you can build a system where all of the code for any given subsystem (e.g. a single dimension or fact table) lives in one and only one place in the system. This type of architecture is also better for larger numbers of data sources.

Point 3 is a counterpoint to point 2. Basically the choice between SQL and ETL tooling is a function of transformation complexity and number of source systems. The simpler the data and larger the number of data sources, the stronger the case for a tools-based approach. The more complex the data, the stronger the case for moving to an architecture based on stored procedures. Generally it's better to exclusively or almost exclusively use one or the other but not both.

Point 6 makes your system easier to test. Testing SCD's or any change based functionality is fiddly, as you have to be able to present more than one version of the source data to the system. If you move the change management functionality into infrastructure code, you can test it in isolation with test data sets. This is a win in testing, as it reduces the complexity of your system testing requirements.

Point 7 is a general performance tip that you will need to observe for large data volumes. Note that you may only need incremental loading for some parts of a system; for smaller reference tables and dimensions you may not need it.

Point 8 is germane to any headless process. If it goes tits up during the night, you want some fighting chance of seeing what went wrong the next day. If the code doesn't properly log what's going on and catch errors, you will have a much harder job troubleshooting it.

Point 9 gives the data warehouse a life of its own. You can easily add and drop source systems when the warehouse has its own keys. Warehouse keys are also necessary to implement slowly changing dimensions.

Point 10 is a maintenance and deployment win, as the ODS can be re-structured if you need to add new systems or change the cardinality of a record. It also means that a dimension can be loaded from more than one place in the ODS (think: adding manual accounting adjustments) without a dependency on the ODS keys.

like image 143
ConcernedOfTunbridgeWells Avatar answered Nov 15 '22 10:11

ConcernedOfTunbridgeWells