Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use Data Lake or Blob on HDInsights cluster on Azure

When creating a HDInsights Hadoop cluster in Azure there are two storage options. Either Azure Data Lake Store (ADLS) or Azure Blob Storage.

What are the real differences between these two options and how do they affect the performance?

I found this page https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-comparison-with-blob-storage But it is not very specific, only uses very general terms like "ADLS is optimized for analytics".

Does it mean that its better for storing the HDInsights file system? And if ADLS is indeed faster then why not use it for non-analytics data as well?

like image 443
viblo Avatar asked Nov 28 '17 10:11

viblo


2 Answers

As per this document, an Azure Storage account can hold up to 4.75 TB, though individual blobs (or files from an HDInsight perspective) can only go up to 195 GB. Azure Data Lake Store can grow dynamically to hold trillions of files, with individual files greater than a petabyte. For more information, see Understanding blobs and Data Lake Store.

Also, check Benefits of Azure Storage and Use Data Lake Store for more details and comparisons.

Hope this helps.

like image 190
AshokPeddakotla-MSFT Avatar answered Oct 13 '22 08:10

AshokPeddakotla-MSFT


In addition to Ashok's answer: ADLS is currently only available in a few regions, compared to Azure Storage. So if you need your HDInsight account in a specific region, you should make sure your storage is in the same region.

Another benefit of ADLS over Azure Storage is its POSIX-based security model at the file/folder level that uses AAD security principals instead of Shared Access Keys.

The reason why you may not want to use ADLS for non-analytics data is primarily cost. Because of some of the additional capabilities, it is currently a bit more expensive.

like image 44
Michael Rys Avatar answered Oct 13 '22 09:10

Michael Rys