I have recently been exposed to some ETL tools such as Talend and Apatar and I was wondering what exactly the purpose/main goal of these tools is in laymans terms. Who primarily uses them and if you use them, how they are (from my understanding) better than just writing some type of scripts.
ETL stands for "Extract/Transform/Load". These tools take data from one source and move it into another. You can map schemas from the source to the destination in unique ways, transform and cleanse data before it moves into the destination, and load the destination in an efficient manner. You can schedule ETL jobs as batch processes.
Those data sources can be relational databases, spreadsheets, XML files, etc.
Who "uses" them? Depends on what you mean by "uses". They're just code and most of the time, they're scheduled as part of regular operations. There are no end-user features. They're totally for programmers to create and operations to operate.
Advantage over scripts? None. They are scripts written in a domain-specific language (DSL) focused entirely on "extract" from source, "transform" and "load" to destination. Most of the interesting part of the script is the field-by-field mappings at each stage.
ETL is simply get data, do something to it, stick the data somewhere.
Extraction - get data from somewhere (the location is called the Source). The Source can be files, database tables, a pipe, etc.
Transformation - do something to the source data. E.g. your source contains person and birthdate, so you can "transform" the data to determine the person's age.
Load - Take the transformed data and put it somewhere, such as a different database or the same database. This is called the Target. Almost any type of data store that can be a Source can also be used as a Target.
Some examples:
Extract data from a data warehouse in DB2. Transform the data using a filter to only send along data that matches a certain date range. Load the filtered data into a SQL Server data mart.
Extract data from XML files. Transform the data so that the data is "flattened" into rows. Load the data into a comma delimited file.
Extract data that is coming in on a pipe or a queue, such as IBM MQ. Transform the data by parsing out key fields in the queued message. Load the parsed data into a mainframe.
Some tools:
Big bucks, Enterprise class: Ab Initio, Informatica, IBM's Datastage
Less pricey: Jitterbit, Talend
Somewhere in the middle: SSIS for SQL Server
Who uses them? Primarily companies that need to build and maintain complex data warehouses will invest in an ETL tool. ETL tools are often visual design tools that allow you to build an ETL program visually, rather than using only programmatic techniques.
ETL is commonly used in data warehousing applications.
For example, you might have an Oracle or Sql Server order processing system. This might keep all the data until the order is shipped, but you wouldn't want years worth of old orders clogging up the system.
Additionally, you might have several systems like this in your company, all developed independently of each other.
So, to consolidate the historical data, you might set up a data warehouse where the data from all of these disparate systems end up, allowing you a nice place to do reporting, planning, data mining, etc.
Since all the data sources are different, and the kinds of data you want to store long-term might differ than the data you have in the smaller databases, you set up an ETL system to convert and manage the data flow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With