Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the point of a table in a data lake? [closed]

I thought the whole point of using a Data Lake versus a Data Warehouse was to invert the ETL (Extract, Transform, Load) process to LET (Load, Extract, Transform). Doesn't extracting this data, transforming and loading it into a table get us right back where we started?

like image 348
Chris B. Behrens Avatar asked Oct 16 '17 19:10

Chris B. Behrens


Video Answer


2 Answers

IMHO the point of a data lake is to store all types of data: unstructured, semi-structured and structured. The Azure version of this is Azure Data Lake Store (ADLS) and its primary function is scalable, high-volume storage.

Separately, there is a product Azure Data Lake Analytics (ADLA). This analytics product can interact with ADLS, but also blob storage, SQL on a VM (IaaS) and the two PaaS database offerings, SQL Database and SQL Data Warehouse and HDInsight. It has a powerful batch language called U-SQL, a combination of SQL and .net for interrogating and manipulating these data stores. It also has a database option which, where appropriate, allows you to store data you have processed in table format.

One example might be where you have some unstructured data in your lake, you run your batch output and want to store the structured intermediate output. This is where you might store the output in an ADLA database table. I tend to use them where I can prove I can get a performance improvement out of them and/or want to take advantage of the different indexing options.

I do not tend to think of these as warehouse tables because they don't interact well with other products yet, ie they don't as yet have endpoints / aren't visible, eg Azure Data Factory can't move tables from there yet.

Finally I tend to think of ADLS as analogous to HDFS and U-SQL/ADLA as analogous to Spark.

HTH

like image 158
wBob Avatar answered Sep 26 '22 02:09

wBob


By definition a data lake is a huge repository storing raw data in it's native format until needed. Lakes use a flat architecture rather than nested (http://searchaws.techtarget.com/definition/data-lake). Data in the lake has a unique ID and metadata tags, which are used in queries.

So data lakes can store structured, semi-structured and unstructured data. Structured data would include SQL database type data in tables with rows and columns. Semi-structured would be CSV files and the like. And unstructured data is anything and everything -- emails, PDFs, video, binary. It's that ID and the metadata tags that help users find data inside the lake.

To keep a data lake manageable, successful implementers rotate, archive or purge data from the lake on a regular basis. Otherwise it becomes what some have called a "data swamp", basically a graveyard of data.

The traditional ELT process is better suited to data warehouses because they are more structured and data in a warehouse is there for a purpose. Data lakes, being less structured, are more suited to other approaches such as ELT (Extract, Load, Transform), because they store raw data that is only categorized by each query. (See this article by Panopoly for a discussion of ELT vs ETL.) For example, you want to see customer data from 2010. When you query a data lake for that you will get everything from accounting data, CRM records and even emails from 2010. You cannot analyze that data until it has been transformed into usable formats where the common denominators are customers + 2010.

like image 36
Kobi Lehrer Avatar answered Sep 23 '22 02:09

Kobi Lehrer