I was going through the Microsoft documents:
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview
I'm new to Azure Data lake and HDInsight. There is a statement in the URL which tells that
"Azure Data Lake Store can be accessed from Hadoop (available with HDInsight cluster) using the WebHDFS-compatible REST APIs."
As per my initial understanding, Data lake store is a store in which any kind of data can be stored. I think, HDInsight also kind of does the same thing.
My question is what is the difference between Azure Data lake and Azure HDInsight? If HDInsight can be used for file storage or any kind of storage then Why to use Data Lake?It would be great if some one could clarify this in details. Thanks.
Azure Data Lake Analytics provides server less compute while using Azure Data Lake Store for data storage, whereas in HDInsight,we need to specify and design for Compute Virtual Machine nodes as per processing requirements.
Azure Blob Storage is a general purpose, scalable object store that is designed for a wide variety of storage scenarios. Azure Data Lake Storage Gen1 is a hyper-scale repository that is optimized for big data analytics workloads. Based on shared secrets - Account Access Keys and Shared Access Signature Keys.
HDInsight has been around for a number of years. Synapse can be 'paused' , is consumption-based, and has a much more gentle learning curve. Synapse incorporates many other Azure services and is becoming a one-stop hub for Analytics and Data Orchestration.
Azure Data Lake includes all the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages.
HDInsight Azure HDInsight is a cloud service that allows cost-effective data processing using open-source frameworks such as Hadoop, Spark, Hive, Storm, and Kafka, among others.
I'm new to Azure Data lake and HDInsight. There is a statement in the URL which tells that "Azure Data Lake Store can be accessed from Hadoop (available with HDInsight cluster) using the WebHDFS-compatible REST APIs." As per my initial understanding, Data lake store is a store in which any kind of data can be stored.
Data Lake Storage Gen2 is available as a storage option for almost all Azure HDInsight cluster types as both a default and an additional storage account. HBase, however, can have only one account with Data Lake Storage Gen2.
Azure Data Lake Analytics. Azure Data Lake is an on-demand scalable cloud-based storage and analytics service. It can be divided in two connected services, Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA). ADLS is a cloud-based file system which allows the storage of any type of data with any structure,...
In nutshell,
Hdinsight is a managed hadoop service (to provide compute support)
Azure Data lake(ADL) is a managed storage service (to provide large amount of storage support)
(Instead of ADL, you can alternatively choose to use Blobs in HDinsight, but Blobs have some limitations (like file streaming to storage via hdinsight cluster is not supported)
Here is the definition from Azure documentation (below):
Azure uses "decomposed hardware method"
You can relate or assume HDinsight as a Hadoop Cluster, Azure Data lake (ADL) as HDFS. But they are detached.
If you want to relate with AWS, HDInsight is equivalent to EMR and ADL is equivalent to EMRFS or S3
If you terminate the cluster, ADL storage stays with the files stored in it. You can access the storage directly using another service or tool (like Azure Data bricks) or you can create one another hdinsight cluster on top of the data.
Hdinsight access the ADL using adl:// , and hdinsight never store the file blocks in the nodes (like Hadoop does), rather it has mappings to storage service.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With