Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is hierarchical namespace in Microsoft Azure Data Lake storage (Gen2)?

I read Microsoft's document regarding it. link -> https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace. But unable to understand it clearly.

Can anyone please help me to understand it in layman term / simple language?

How this feature separates ADLS from Azure Blob storage?

like image 388
nomadSK25 Avatar asked Sep 18 '19 10:09

nomadSK25


2 Answers

The summary, for now, is that Hierarchical Namespace changes Azure Storage to a more ADLS Gen1 style store in practice, but with a compromise of losing some Azure Blob Storage based functionality.

Hierarchical Namespace gains you:

  • A folder structure that behaves more like a traditional OS File system in terms of moves and renames
  • Fine-grain AAD based Access Control at the directory/sub-directory level

At the same time, you lose Blob Storage features including:

  • Blob Soft Delete (delete/recover blobs)
  • Custom Domain, Azure CDN, Azure Search integration in Azure Portal UI
  • Blob Lifecycle Management in UI (Archiving/Deleting/Warming up blobs by the filter on schedule)
  • Limited support of Blob Storage API

In practice, you can expect to experience some inconsistent incompatibilities with anything that tries to interact with Azure Storage. It might work 100%, it might refuse to work at all (or not list the Storage Account as an option, if using Azure Portal UI wizards), or it might work partially. Without knowing the underlying implementation, it's difficult to predict testing.

But, things are still fluid. There are definitely signs that these compromises are due to be addressed in the road-map, especially based from the list of known issues https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-known-issues.

like image 167
Andrew Sexton Avatar answered Sep 25 '22 02:09

Andrew Sexton


One of the major differences between data storage and blob storage is the hierarchical namespace. A hierarchal namespace is a very important added feature in data storage Gen 2 if you remember while converting our storage account to Data Lake, we enable hierarchical namespace setting and that's how your storage account converted into your data storage Gen 2 account. 

Hierarchical storage simply means that the collection of objects and files is organized into a tree of folders and nested folders in the same way that the file system on our computer and laptop is organized. So basically hierarchical namespace organizes the objects or files into a hierarchy of directories for efficient data access. Now, if you have some experience with blob storage, you might be wondering why it is not considered hierarchical. After all, blobs are often organized in a structure that seems to include folders and subfolders. However, that is simply a naming convention, you can put slashes in your blob names to simulate a tree-like hierarchical structure. But they are really just files in a flat structure. But now I can actually have the concept of these folders. But this simple-looking change actually making a huge difference in big data analytics. Now, if you remember, blob storage does not support that this hierarchical structure, but Hadoop requires to have this hierarchical namespace to integrate with the storage. And that's why Hadoop cannot be integrated with the blob. But data lake supports a hierarchical namespace. And this makes data like Gen 2 seamlessly integrate with the huge ecosystem of Hadoop software. 

Now, as I said in blob storage, we were using the slashers to simulate a tree-like directory structure. It was to a certain extent to organize objects. But when it comes to action like moving or renaming or deleting the directories, these slashes like structure brings no help because without real directories applications had to perform potentially millions of individual blobs to achieve directory level task. And by contrast, a hierarchical namespace processes these tasks by updating a single entry. So Gen 2 is really manageable. Delete, rename is easy. Moving is easy. You can organize, you can manipulate, file through directories and subdirectories.  For blob storage to operate on a simulated folder. It has to perform a separate operation on each file. But if you see data lake Gen2, it is designed to perform operations on a folder so it can do so very quickly. So let me put some context around this. Imagine you have a folder with 5000 files in traditional object storage and let's say you need to rename this folder. If you had to perform such an operation on other object stories like Blob, it would mean that you have to do 5000 file copies and then 5000 file delete, all because you have to perform these operations from the front end. 

But with the data lake Gen 2, these operations take place in the back end. So for you, it is just a single call or it is just a single action. The hierarchical namespace feature has also significantly improved the overall performance of many analytics jobs. This improvement in performance means that you require less computing power to process the same amount of data. That means a lower total cost of ownership for end-to-end analytics jobs. Now filesystem file systems are well understood by developers and users. 

Now you may ask that why it was not done before. Actually, one of the reasons that object stores have not historically supported a hierarchical namespace is that a hierarchical namespace limits the scalability. However, data storage Gen2 hierarchical namespace scales linearly and does not degrade either the data capacity or performance. And there are some scenarios where you actually don't want to use the hierarchical namespace because some workload might not gain any benefit by enabling hierarchical namespace. For example, if you have a backup if you have image storage or some other applications where the object organization is stored separately from the objects themselves, or in some kind of separate database.  So basically it all depends on your requirement.

And after you have enabled hierarchical namespace on your account, you cannot revert back to the flat namespace. So just keep this in mind. I think by now I have cleared what hierarchical namespace is? and how hierarchical namespace makes your data storage Gen 2 very special among all the storage services.

like image 25
Ayush Dixit Avatar answered Sep 24 '22 02:09

Ayush Dixit