Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Azure Data Table - Correct usage of RowKey as DateTime.Ticks?

I'm working on an Azure project that involves Azure IOT Hub and Azure Functions.

I have about 50 sensors that send one new message to IOT Hub every 10 seconds.

Each time Azure IOT Hub receives a new message, I want to execute a function that reads the sent message and saves it into Azure Table storage.

At the moment, I'm a little bit lost as to what kind of an Azure Table storage design I should be using. This is so far my proposed Table storage design:

[PartitionKey][RowKey][TimeStamp][SensorSerial][Reading][Type]

And this is a mock up of how the data would look like in Azure Storage Explorer:

 [GroupA][?][2017-05-03T12:20:22.713Z][xxx][60][Temperature]
 [GroupA][?][2017-05-03T12:25:22.713Z][xxx][61][Temperature]
 [GroupA][?][2017-05-03T12:30:22.713Z][xxx][59][Temperature]
 [GroupB][?][2017-05-03T12:35:22.713Z][yyy][90][Humidity]
 [GroupB][?][2017-05-03T12:40:22.713Z][yyy][92][Humidity]

I've left RowKey at "?" for the moment because it's related to question at hand.

The problem is, that I want to be able to query the Table storage data based both on SensorSerial and specified time frame - e.g. get all xxx readings from last 15 seconds.

The following query always returns no data:

TableQuery<Readings> rangeQuery = new TableQuery<Readings>().Where(
TableQuery.CombineFilters(
    TableQuery.GenerateFilterCondition("SensorSerial", QueryComparisons.Equal, "xxx"),
    TableOperators.And,
    TableQuery.GenerateFilterConditionForDate("TimeStamp", 
    QueryComparisons.GreaterThanOrEqual, DateTime.Now.AddSeconds(-15))));

From what I've read so far, and I'm not sure as to why is that so - one cannot filter data based on TimeStamp field. Because of this, you must use RowKey as some sort of a pseudo-TimeStamp datetime tick field.

So in order to fix this, I plan on using this as my RowKey vaue

var RowKey = string.Format("{0:D19}", DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks);

Which would satisfy this query and return the necessary values:

TableQuery<Readings> query = new TableQuery<SensorEntity>().Where(
TableQuery.CombineFilters(
(TableQuery.GenerateFilterCondition("SensorSerial", QueryComparisons.Equal, "xxxx")), 
TableOperators.And,
(TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.LessThanOrEqual, 
"2519084875883616261"))));

However, and I might be wrong here, this approach could potentially cause some issues because of the following:

What if two or more sensors being to transmit data at the same time/interval? RowKey must be unique, the moment one sensor inserts a new row into Azure Storage, the other will no linger be able to.

I could run the code hoping that transmission/data processing/insertion causes enough delay to never cause any issue, but relying on it would be bad.

Is there a better way? A more fail safe approach that allows me to query Azure Data Table storage based on specified time and unique device identifier?

like image 679
OverflowStack Avatar asked May 03 '17 13:05

OverflowStack


1 Answers

Let's first talk about your current approach.

The approach you're taking is quite OK for now. The plus side of your approach is that you're using reverse ticks (DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks) which will ensure that the latest data gets added to the top of the table instead of at the bottom of the table so as long as you're querying last x minutes/hours of data, the retrieval will be very fast.

Down the road I see some issues with this approach:

  • As the data grows and when you wish to query really old data then you will run into situations where Partition Scans are happening. This is somewhat better than Full Table Scans but it should be avoided if possible.
  • You're putting everything in one table so you will eventually hit the scalability limits imposed by table service as all reads/writes are happening on just one table. This will have adverse impact on the performance.

Possible Solution

One possible solution (considering for now your queries are targeted for a sensor) is to have a separate table for each sensor and then store the data for that sensor in the designated table. The advantage I see with this approach are:

  • Since each sensor gets its own table, you have essentially freed up one key. In this scenario, you could use PartitionKey as reverse ticks and RowKey to any other value you like. I would recommend storing ticks with higher granualarity (say an hour) for PartitionKey and keep the RowKey as the same. This would ensure that you don't end up creating a lot of partitions.
  • Since each sensor data is stored in a separate table, you could potentially put them in different storage accounts. So SensorA table could be in Storage Account A and Sensor B table could be in Storage Account B. That way you're essentially load balancing the traffic between different tables/storage accounts and would achieve better scalability and throughput.

Obviously the downside of this approach is that it increases more management headache for you. You would need to have some kind of master database where you would keep the association between sensors and their associated storage accounts. Other downside of this approach is that you will not be able to query on just timestamp (my 2nd question). For this, you could keep just one table in another storage account with the approach you're taking.

Regarding your comment What if two or more sensors being to transmit data at the same time/interval? RowKey must be unique, the moment one sensor inserts a new row into Azure Storage, the other will no linger be able to., essentially the RowKey must be unique in a Partition or in other words PartitionKey + RowKey combination must be unique in a table. So I don't think its going to be an issue.

like image 92
Gaurav Mantri Avatar answered Nov 05 '22 21:11

Gaurav Mantri