Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to partition Azure tables used for storing logs

We have recently updated our logging to use Azure table storage, which owing to its low cost and high performance when querying by row and partition is highly suited to this purpose.

We are trying to follow the guidelines given in the document Designing a Scalable Partitioning Strategy for Azure Table Storage. As we are making a great number of inserts to this table (and hopefully an increasing number, as we scale) we need to ensure that we don't hit our limits resulting in logs being lost. We structured our design as follows:

  • We have a Azure storage account per environment (DEV, TEST, PROD).

  • We have a table per product.

  • We are using a TicksReversed+GUID for the Row Key, so that we can query blocks of results between certain times with a high performance.

  • We originally chose to partition the table by Logger, which for us were broad areas of the product such as API, Application, Performance and Caching. However, due to the low numbers of partitions we were concerned that this resulted in so-called "hot" partitions where many inserts were performed on one partition in a given time period. So we changed to partition on Context (for us, the class name or API resource).

However, in practice we have found this is less than ideal, because when we look at our logs at a glance we would like them to appear in order of time. We instead end up with blocks of results grouped by context, and we would have to get all partitions if we want to order them by time.

Some ideas we had were

  • use blocks of time (say 1 hour) for partition keys to order them by time (results in hot partitions for 1 hour)

  • use a few random GUIDs for partition keys to try to distribute the logs (we lose the ability to query quickly on features such as Context).

As this is such a common application of Azure table storage, there must be some sort of standard procedure. What is the best practice for partitioning Azure tables that are used for storing logs?

Solution constraints

  • Use cheap Azure storage (Table Storage seems the obvious choice)

  • Fast, scalable writes

  • Low chance of lost logs (i.e. by exceeding the partition write rate of 2000 entities per second in Azure table storage).

  • Reading ordered by date, most recent first.

  • If possible, to partition on something that would be useful to query (such as product area).

like image 930
08Dc91wk Avatar asked Apr 24 '15 08:04

08Dc91wk


2 Answers

I have come across similar situation you encountered, based on my experience I could say:

Whenever a query is fired on an azure storage table, it does a full table scan if a proper partition key is not provided. In other words, storage table is indexed on Partition key and partitioning the data properly is the key to get fast results.

That said, now you will have to think on what kind of queries you would fire on the table. Such as Logs occurred during a time period, for a product etc.

One way is to use reverse ticks up to hour precision instead of using the exact ticks as part of Partition Key. That way an hour worth of data can be queried based on this partition key. Depending on the number of rows which fall in to each partition, you could change the precision to a day. Also, it will be wise to store related data together, that means data for each product would go to a different table. That way you can reduce the number of partitions and number of rows in each partition.

Basically, ensure that you know the partition keys in advance (exact or range) and fire queries against such specific partition keys to get results faster.

To speed up writing to table, you can use Batch Operation. Be cautious though as if one entity on the batch fails whole batch operation fails. Proper retry and error checking can save you here.

At the same time, you could use blob storage to store lot of related data. The idea is to store a chunk of related serialized data as one blob. You can hit one such blob to get all the data in it and do further projections on the client side. For example, an hour worth of data for a product would go to a blob, you can devise a specific blob prefix naming pattern and hit the exact blob when needed. This will help you get your data pretty fast rather than doing a table scan for each query.

I used the blob approach and have been using it for couple of years with no troubles. I convert my collection to IList<IDictionary<string,string>> and use binary serialization and Gzip for storing each blob. I use Reflection.Emmit based helper methods to access entity properties pretty fast so serialization and deserialization doesn't take a toll on the CPU and memory.

Storing data in blobs help me store more for less and get my data faster.

like image 94
Amit Avatar answered Sep 19 '22 05:09

Amit


There is a very general trick to avoid hot spots when writing while at the same time increasing read costs a bit.

Define N partitions (like 10 or so). When writing a row stuff it into a random partition. Partitions can be sorted by time internally.

When reading you need to read from all N partitions (possibly filtered and ordered by time) and merge the query results.

This increases write scalability by a factor of N and increases query cost by the same number of round-trips and queries.

Also, you could consider storing logs somewhere else. The very tight artificial limits on Azure products cause labor costs that you otherwise would not have.

Choose N to be higher than needed to reach the 20,000 operations per second per account limit so that randomly occurring hotspots are unlikely. Picking N to be twice as high as minimally needed should be enough.

like image 20
usr Avatar answered Sep 22 '22 05:09

usr