So the scenario is the following: I have a multiple instances of a web service that writes a blob of data to Azure Storage. I need to be able to group blobs into a container (or a virtual directory) depending on when it was received. Once in a while (every day at the worst) older blobs will get processed and then deleted. I have two options: Option 1 I make one container called "blobs" (for example) and then store all the blogs into that container. Each blob will use a directory style name with the directory name being the time it was received (e.g. "hr0min0/data.bin", "hr0min0/data2.bin", "hr0min30/data3.bin", "hr1min45/data.bin", ... , "hr23min0/dataN.bin", etc - a new directory every X minutes). The thing that processes these blobs will process hr0min0 blobs first, then hr0minX and so on (and the blobs are still being written when being processed). Option 2 I have many containers each with a name based on the arrival time (so first will be a container called blobs_hr0min0 then blobs_hr0minX, etc) and all the blobs in the container are those blobs that arrived at the named time. The thing that processes these blogs will process one container at a time. So my question is, which option is better? Does option 2 give me better parallelization (since a containers can be in different servers) or is option 1 better because many containers can cause other unknown issues?

I don't think it really matters (from a scalability/parallelization perspective), because partitioning in Win Azure blobs storage is done at the blob level, not the container. Reasons to spread out across different containers have more to do with access control (e.g. SAS) or total storage size. See here for more details: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx (Scroll down to "Partitions"). Quoting: <blockquote> Blobs – Since the partition key is down to the blob name, we can load balance access to different blobs across as many servers in order to scale out access to them. This allows the containers to grow as large as you need them to (within the storage account space limit). The tradeoff is that we don’t provide the ability to do atomic transactions across multiple blobs. </blockquote>

Is it better to have many small Azure storage blob containers (each with some blobs) or one really large container with tons of blobs?

Tags:

azure

azure-storage

azure-blob-storage

So the scenario is the following:

I have a multiple instances of a web service that writes a blob of data to Azure Storage. I need to be able to group blobs into a container (or a virtual directory) depending on when it was received. Once in a while (every day at the worst) older blobs will get processed and then deleted.

I have two options:

Option 1

I make one container called "blobs" (for example) and then store all the blogs into that container. Each blob will use a directory style name with the directory name being the time it was received (e.g. "hr0min0/data.bin", "hr0min0/data2.bin", "hr0min30/data3.bin", "hr1min45/data.bin", ... , "hr23min0/dataN.bin", etc - a new directory every X minutes). The thing that processes these blobs will process hr0min0 blobs first, then hr0minX and so on (and the blobs are still being written when being processed).

Option 2

I have many containers each with a name based on the arrival time (so first will be a container called blobs_hr0min0 then blobs_hr0minX, etc) and all the blobs in the container are those blobs that arrived at the named time. The thing that processes these blogs will process one container at a time.

So my question is, which option is better? Does option 2 give me better parallelization (since a containers can be in different servers) or is option 1 better because many containers can cause other unknown issues?

324

asked Nov 16 '11 20:11

encee

2 Answers

I don't think it really matters (from a scalability/parallelization perspective), because partitioning in Win Azure blobs storage is done at the blob level, not the container. Reasons to spread out across different containers have more to do with access control (e.g. SAS) or total storage size.

See here for more details: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx

(Scroll down to "Partitions").

Quoting:

Blobs – Since the partition key is down to the blob name, we can load balance access to different blobs across as many servers in order to scale out access to them. This allows the containers to grow as large as you need them to (within the storage account space limit). The tradeoff is that we don’t provide the ability to do atomic transactions across multiple blobs.

answered Sep 28 '22 01:09

Eugenio Pace

Everyone has given you excellent answers around accessing blobs directly. However, if you need to list blobs in a container, you will likely see better performance with the many-container model. I just talked with a company who's been storing a massive number of blobs in a single container. They frequently list the objects in the container and then perform actions against a subset of those blobs. They're seeing a performance hit, as the time to retrieve a full listing has been growing.

This might not apply to your scenario, but it's something to consider...

answered Sep 28 '22 01:09

David Makogon

Related questions
                            
                                What is going wrong with web deployment from Visual Studio and App service?
                            
                                How do I pause an Azure App Service Plan?
                            
                                Azure CLI vs Powershell?
                            
                                What is the difference between an Azure tenant and Azure subscription?
                            
                                When should I use Sql Azure and when should I use table Storage?
                            
                                Differences between Azure Block Blob and Page Blob?
                            
                                Azure Functions Database Connection String
                            
                                Error when connect database continuously
                            
                                How to restore my Local database to Windows Azure Database?
                            
                                AADSTS70005: response_type 'id_token' is not enabled for the application
                            
                                Why do I get the error "The target GatherAllFilesToPublish does not exist"?
                            
                                How do I create a new user in a SQL Azure database?
                            
                                Where do you set and access run-time configuration parameters per environment for service fabric?
                            
                                Is it possible to rename an Azure App Service plan using the Azure Portal?
                            
                                Removing/Hiding/Disabling excessive HTTP response headers in Azure/IIS7 without UrlScan
                            
                                What is the simplest way to run a timer-triggered Azure Function locally once?
                            
                                Set Content-type of media files stored on Blob
                            
                                How to run Azure Function app on a different port in Visual Studio
                            
                                Warning: Multiple merge bases detected. The list of commits displayed might be incomplete
                            
                                Azure SQL Database "DTU percentage" metric

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With