I read Microsoft's document regarding it. link -> https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace. But unable to understand it clearly.
Can anyone please help me to understand it in layman term / simple language?
How this feature separates ADLS from Azure Blob storage?
The summary, for now, is that Hierarchical Namespace changes Azure Storage to a more ADLS Gen1 style store in practice, but with a compromise of losing some Azure Blob Storage based functionality.
Hierarchical Namespace gains you:
At the same time, you lose Blob Storage features including:
In practice, you can expect to experience some inconsistent incompatibilities with anything that tries to interact with Azure Storage. It might work 100%, it might refuse to work at all (or not list the Storage Account as an option, if using Azure Portal UI wizards), or it might work partially. Without knowing the underlying implementation, it's difficult to predict testing.
But, things are still fluid. There are definitely signs that these compromises are due to be addressed in the road-map, especially based from the list of known issues https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-known-issues.
One of the major differences between data storage and blob storage is the hierarchical namespace. A hierarchal namespace is a very important added feature in data storage Gen 2 if you remember while converting our storage account to Data Lake, we enable hierarchical namespace setting and that's how your storage account converted into your data storage Gen 2 account.
Hierarchical storage simply means that the collection of objects and files is organized into a tree of folders and nested folders in the same way that the file system on our computer and laptop is organized. So basically hierarchical namespace organizes the objects or files into a hierarchy of directories for efficient data access. Now, if you have some experience with blob storage, you might be wondering why it is not considered hierarchical. After all, blobs are often organized in a structure that seems to include folders and subfolders. However, that is simply a naming convention, you can put slashes in your blob names to simulate a tree-like hierarchical structure. But they are really just files in a flat structure. But now I can actually have the concept of these folders. But this simple-looking change actually making a huge difference in big data analytics. Now, if you remember, blob storage does not support that this hierarchical structure, but Hadoop requires to have this hierarchical namespace to integrate with the storage. And that's why Hadoop cannot be integrated with the blob. But data lake supports a hierarchical namespace. And this makes data like Gen 2 seamlessly integrate with the huge ecosystem of Hadoop software.
Now, as I said in blob storage, we were using the slashers to simulate a tree-like directory structure. It was to a certain extent to organize objects. But when it comes to action like moving or renaming or deleting the directories, these slashes like structure brings no help because without real directories applications had to perform potentially millions of individual blobs to achieve directory level task. And by contrast, a hierarchical namespace processes these tasks by updating a single entry. So Gen 2 is really manageable. Delete, rename is easy. Moving is easy. You can organize, you can manipulate, file through directories and subdirectories. For blob storage to operate on a simulated folder. It has to perform a separate operation on each file. But if you see data lake Gen2, it is designed to perform operations on a folder so it can do so very quickly. So let me put some context around this. Imagine you have a folder with 5000 files in traditional object storage and let's say you need to rename this folder. If you had to perform such an operation on other object stories like Blob, it would mean that you have to do 5000 file copies and then 5000 file delete, all because you have to perform these operations from the front end.
But with the data lake Gen 2, these operations take place in the back end. So for you, it is just a single call or it is just a single action. The hierarchical namespace feature has also significantly improved the overall performance of many analytics jobs. This improvement in performance means that you require less computing power to process the same amount of data. That means a lower total cost of ownership for end-to-end analytics jobs. Now filesystem file systems are well understood by developers and users.
Now you may ask that why it was not done before. Actually, one of the reasons that object stores have not historically supported a hierarchical namespace is that a hierarchical namespace limits the scalability. However, data storage Gen2 hierarchical namespace scales linearly and does not degrade either the data capacity or performance. And there are some scenarios where you actually don't want to use the hierarchical namespace because some workload might not gain any benefit by enabling hierarchical namespace. For example, if you have a backup if you have image storage or some other applications where the object organization is stored separately from the objects themselves, or in some kind of separate database. So basically it all depends on your requirement.
And after you have enabled hierarchical namespace on your account, you cannot revert back to the flat namespace. So just keep this in mind. I think by now I have cleared what hierarchical namespace is? and how hierarchical namespace makes your data storage Gen 2 very special among all the storage services.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With