I'm considering using Data Lake technologies which I have been studying for the latest weeks, compared with the traditional ETL SSIS scenarios, which I have been working with for so many years. I think of Data Lake as something very linked to big data, but where is the line between using Data Lake technolgies vs SSIS? Is there any advantage of using Data Lake technologies with 25MB ~100MB ~ 300MB files? Parallelism? flexibility? Extensible in the future? Is there any performance gain when the files to be loaded are not so big as U-SQL best scenario... What are your thoughts? Would it be like using a hammer to crack a nut? Please, don't hesitate to ask me any questions to clarify the situation. Thanks in advance!! 21/03 EDIT More clarifications: <ol> <li>has to be on the cloud</li> <li>the reason I considered about using ADL is because there is no substitution for SSIS in the cloud. There is ADF, but it's not the same, it orchestrates the data, but it's not so flexible as SSIS</li> <li>I thought I could use U-SQL for some (basic) transformations but I see some problems <ul> <li>There are many basic things I cannot do: loops, updates, writing logs in a SQL...</li> <li>The output can only be a U-SQL table or a file. The architecture doesn't look good this way (despite U-SQL is very good with big files, if I need an extra step to export the file to another DB or DWH) - Or maybe this is the way it's done in Big Data Warehouses... I don't know</li> <li>In my tests, It takes 40s for a 1MB file, and 1:15s for a 500MB file. I cannot justify a 40s process for 1MB (plus uploading to the Database/Data Warehouse with ADF)</li> <li>The code looks unorganised for a user, as the scripts with many basic validations will be U-SQL scripts too long. </li> </ul> </li> </ol> Don't get me wrong, I really like ADL techonologies, but I think that for now, it's for something very specific and still there is no substitution for SSIS in the cloud. What do you thing? Am I wrong?

For me, if the data is highly structured and relational, the right place for it is a relational database. In Azure you have several choices: <ol> <li>SQL Server on a VM (IaaS) Ordinary SQL Server running on a VM, you have to install, configure and manage it yourself but you get the full flexibility of the product.</li> <li>Azure SQL Database PaaS database option targetted at smaller volumes but now up to 4TB. All of the features of normal SQL Server with potentially lower TCO and the option to scale up or down using tiers.</li> <li>Azure SQL Data Warehouse (ADW) MPP product suitable for large warehouses. For me, the entry criteria is warehouses at least 1TB in size, and probably more like 10TB. It's really not worth having a MPP for small volumes.</li> </ol> For all database options you can use clustered columnstore indexes, (the default in ADW), which can give massive compression, between 5x and 10x. 400MB per day for a year totals ~143GB, which honestly is not that much in modern data warehouse terms, which are normally measured in terabytes (TB). Where Azure Data Lake Analytics (ADLA) comes in, is doing things you cannot do in ordinary SQL, like: <ul> <li>combine the power of C# with SQL for powerful queries - example here </li> <li>dealing with unstructured files like images, xml or JSON - example here </li> <li>using RegEx</li> <li>scale out R processing - example here </li> </ul> ADLA also offers federated queries, the ability to "query data where it lives", ie bring together structured data from your database and unstructured data from your lake. Your decision seems more to do with whether or not you should be using the cloud. If you need the elastic and scalable features of cloud then Azure Data Factory is the tool for moving data from place to place in the cloud. HTH

Be careful. This question is likely to get closed for being too broad. There are many arguments for and against. We can't discuss them all here. ADL isn't a replacement for SSIS. The consultants answer as always will be.. it depends what your doing/trying to do. A simplistic answer might be. ADL is unlimited and highly scalable. SSIS is not. But, yes, ADL has a high entry point for small files because of that scalability. Generally I don't think the two technologies are comparable. If you want SSIS in Azure. Wait for MS to release it as a PaaS. Or use a virtual machine.

Reasons to use Azure Data Lake Analytics vs Traditional ETL approach

Tags:

azure

azure-data-lake

u-sql

I'm considering using Data Lake technologies which I have been studying for the latest weeks, compared with the traditional ETL SSIS scenarios, which I have been working with for so many years.

I think of Data Lake as something very linked to big data, but where is the line between using Data Lake technolgies vs SSIS?

Is there any advantage of using Data Lake technologies with 25MB ~100MB ~ 300MB files? Parallelism? flexibility? Extensible in the future? Is there any performance gain when the files to be loaded are not so big as U-SQL best scenario...

What are your thoughts? Would it be like using a hammer to crack a nut? Please, don't hesitate to ask me any questions to clarify the situation. Thanks in advance!!

21/03 EDIT More clarifications:

has to be on the cloud
the reason I considered about using ADL is because there is no substitution for SSIS in the cloud. There is ADF, but it's not the same, it orchestrates the data, but it's not so flexible as SSIS
I thought I could use U-SQL for some (basic) transformations but I see some problems
- There are many basic things I cannot do: loops, updates, writing logs in a SQL...
- The output can only be a U-SQL table or a file. The architecture doesn't look good this way (despite U-SQL is very good with big files, if I need an extra step to export the file to another DB or DWH) - Or maybe this is the way it's done in Big Data Warehouses... I don't know
- In my tests, It takes 40s for a 1MB file, and 1:15s for a 500MB file. I cannot justify a 40s process for 1MB (plus uploading to the Database/Data Warehouse with ADF)
- The code looks unorganised for a user, as the scripts with many basic validations will be U-SQL scripts too long.

Don't get me wrong, I really like ADL techonologies, but I think that for now, it's for something very specific and still there is no substitution for SSIS in the cloud. What do you thing? Am I wrong?

909

asked Mar 17 '17 08:03

Carlos Moreno

3 Answers

For me, if the data is highly structured and relational, the right place for it is a relational database. In Azure you have several choices:

SQL Server on a VM (IaaS) Ordinary SQL Server running on a VM, you have to install, configure and manage it yourself but you get the full flexibility of the product.
Azure SQL Database PaaS database option targetted at smaller volumes but now up to 4TB. All of the features of normal SQL Server with potentially lower TCO and the option to scale up or down using tiers.
Azure SQL Data Warehouse (ADW) MPP product suitable for large warehouses. For me, the entry criteria is warehouses at least 1TB in size, and probably more like 10TB. It's really not worth having a MPP for small volumes.

For all database options you can use clustered columnstore indexes, (the default in ADW), which can give massive compression, between 5x and 10x.

400MB per day for a year totals ~143GB, which honestly is not that much in modern data warehouse terms, which are normally measured in terabytes (TB).

Where Azure Data Lake Analytics (ADLA) comes in, is doing things you cannot do in ordinary SQL, like:

combine the power of C# with SQL for powerful queries - example here
dealing with unstructured files like images, xml or JSON - example here
using RegEx
scale out R processing - example here

ADLA also offers federated queries, the ability to "query data where it lives", ie bring together structured data from your database and unstructured data from your lake.

Your decision seems more to do with whether or not you should be using the cloud. If you need the elastic and scalable features of cloud then Azure Data Factory is the tool for moving data from place to place in the cloud.

HTH

137

answered Nov 01 '22 18:11

wBob

Be careful. This question is likely to get closed for being too broad.

There are many arguments for and against. We can't discuss them all here.

ADL isn't a replacement for SSIS. The consultants answer as always will be.. it depends what your doing/trying to do.

A simplistic answer might be. ADL is unlimited and highly scalable. SSIS is not. But, yes, ADL has a high entry point for small files because of that scalability.

Generally I don't think the two technologies are comparable.

If you want SSIS in Azure. Wait for MS to release it as a PaaS. Or use a virtual machine.

answered Nov 01 '22 18:11

Paul Andrew

I think for simpler transformations it may be a good solution, however if you have complexities, notifications etc. it may be incompatible. A typical scenario would be something like transforming a JSON document to CSV, then taking the CSV and running that through SSIS for further transforms. There is certainly a future state that will enable U-SQL to to be much more powerful, for now I think there are separate and distinct uses for U-SQL/ADLA/ADLS and SSIS.

answered Nov 01 '22 18:11

Carolus Holman

Related questions
                            
                                does azure website support p/invoke to load native c++ dll
                            
                                Azure Notification Hubs registration time to live (90 days limit)
                            
                                No valid key mapping found for securityToken
                            
                                Azure Related Error - VS 2013 Professional - The 'CctProjectPackage' package did not load correctly
                            
                                azure resource manager servicebus provider?
                            
                                SSAS Tabular on Azure?
                            
                                Azure Storm vs Azure Stream Analytics
                            
                                How to share access to application insights data in the azure portal with other azure users
                            
                                Simplest way to log all messages from an Azure Event Hub
                            
                                Running DNX (EF7) database migration on Azure
                            
                                AzureException: Unable to access container using anonymous credentials, and no credentials found for them in the configuration
                            
                                Where is the key in new Azure App service?
                            
                                MySQL Auto increment primary key increases by 10
                            
                                Compile Azure Functions (.csx files) on local machine
                            
                                Custom parameter with Microsoft.Owin.Security.OpenIdConnect and AzureAD v 2.0 endpoint
                            
                                Difference between deployment and provisioning ADF
                            
                                Publish appsettings.production.json onto azure
                            
                                Is there any way to mock Azure CloudQueueClient or CloudQueue?
                            
                                Why am I getting exception Azure WebJobs SDK Dashboard connection string is missing or empty when it is not empty at all?
                            
                                Azure functions github deployment from subfolder

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reasons to use Azure Data Lake Analytics vs Traditional ETL approach

Tags:

azure

azure-data-lake

u-sql

Carlos Moreno

People also ask

3 Answers

wBob

Paul Andrew

Carolus Holman

Recent Activity

Donate For Us