How to efficiently store and query a billion rows of sensor data

Tags:

Situation: I've started a new job and been assigned the task of figuring out what to do with their sensor data table. It has 1.3 billion rows of sensor data. The data is pretty simple: basically just a sensor ID, a date and the sensor value at that point in time (double).

Currently, the data is stored in a table in a MSSQL Server database.

By the end of this year, I expect the number of rows to have increased to 2-3 billion.

I'm looking for a better way to store and query this data (by date), and since there is alot of "big data" products our there, and I have no real experience managing such big data sets, I'm asking here for any pointers.

It's not a large company, and our resources are not unlimited ;)

Some more details about our use case:

The data is plotted in graphs and shows sensor values over time.
We're planning to create an API to let our customers fetch sensor data for any time period that they are interested in (... data from 2 years back is just as relevant as the last month's data).

My research so far has led me to consider the following solutions:

Keep the data in SQL Server

but partition the table (it's not partitioned right now). That will require enterprise version of SQL Server, which costs alot.
Move the data to Azure SQL Server.

There we'll get the partitioning feature for alot less money, but once our DB grows above 250GB it costs alot more (and way too much above 500gb).
Use several Databases

We could use 1 DB per customer. Several smaller DBs will be cheaper than 1 huge DB, but we've got alot of customers and plans for more, so I don't really like thinking about managing all these databases.
Azure Storage Tables

This is the option I like best so far. We can partition the data by company/sensor/year/month, use the date for row key and store the sensor value.

I haven't yet had time to test query performance, but from what I read it should be good. But there is one major downside, and that's the limit of 1000 items returned per HTTP request. If we need to fetch all sensor data for a week, we need to make alot of HTTP requests. I'm not sure right now how big an issue this is for our use case.
Azure HDInsight (Hadoop in Azure)

As mentioned I have no experience with big data, and currently I don't get Hadoop well enough to know if it fits our case (expose sensor data, for a given timespan, through an API). Should I dig deeper and learn or is my time better spent pursuing another alternative?

Does anyone have experience from a similar case. What works for you? Keep in mind that price matters, and a "simple" solution could be preferred over a very complex one even though the complex one performs some seconds better.

UPDATE 1: To answer some of the questions in comments below.

There are roughly 12 000 sensors, that can potentially report a value each 15 second. That translates to ~70 million a day. In reality, not all of these sensors have "reporting" turned on, so we're not getting that much data each day, but since we naturally want to expand with more customers and sensors, I really need a solution that could scale up to many millions of sensor values a day.
Partitioning is a solution, and using several databases and/or several tables, is something I have though of yes, but I see this as a fallback if/when I've exhausted other solutions.
I've read some more about HBase, http://opentsdb.net/ and google's https://cloud.google.com/bigtable/ and it seems like Hadoop could be a real alternative at least.

UPDATE 2: Today I experienced a bit with both azure table storage and HDInsight (HDI). We don't require much in query "flexibility", and so I think Azure Table Storage looks real promising. It's a bit slow to pull out data due to the 1000 item limit per request as I mentioned, but in my tests I think it's fast enough for our use cases.

I've also stumbled upon OpenTSDB, which is what lead me to try out HDI in the first place. Following a tutorial on Azure (https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-tutorial-get-started/) I was able to pretty quickly store a million records and test some queries. It was a lot faster to query than Azure Table Storage. I could even pull down 300 000 records in one http request (took 30 seconds though).

But it costs quite a bit more than Azure Table Storage, and I think I can optimize my code to improve query performance with Azure Table Storage (more fine grained partition key and running requests in parallel). So right now I'm leaning towards Azure Table Storage due to the simplicity, price and "good-enough" performance.

I'm going to present my findings to an external consultant soon, so I'm excited to learn his view on things as well.

443

asked Jan 10 '16 18:01

Trond Blomholm Kvamme

3 Answers

So you're going to have 3 bln records by the end of this year (which have just begun). Each record is 4 bytes ID + 4 bytes datetime + 8 bytes double value which totals in 3*10^9 * (4+4+8) == 48GB.

You can easily store and process this 48GB in a in-memory database like Redis, CouchBase, Tarantool, Aerospike. All of them are open-source, so you don't need to pay a license fee.

There might be some additional overhead on memory consumption of 10-30%, so 48GB can grow up to 64GB or slightly more. You should feed those databases with your real data to pick the most economical one for your case.

Only one physical machine should be enough for the whole workload because in-memory databases are able to handle 100K-1M queries/updates per second per node (the real number depends on your specific workload pattern) . For the sake of better availability I would setup two servers - a master and a slave.

The price of a physical server with 64GB on board up to my experience is $2-3K. Notice that you don't even need an SSD disk. A spinning one should be fine because all the reads hit RAM and all the writes only append to the transaction log. This is how in-memory databases work. I can elaborate on this if you have any questions.

answered Nov 01 '22 01:11

Dennis Anikin

So I have used all the Technologies you listed in some manner or another. What kind of queries do you need to perform? Because depending on that, you could rule some of the solutions. If you don't need to query a lot of different ways Table Storage could work out well for you. Its' going to scale real well if you follow the guidelines, and is cheap. But if you cant just do a point query for the data you need then it might not work so well, or be to complicated to be a good option. Opentsdb is great if you want a time series database. The will limit you to time series type querys. There a lot of time series dbs out there and there a lot of applications that are built on top of it like Bosun and Grafana, to list a two that I use. The last option HDI, I would store the data in parquet format (or some columnar format), create a hive table on top the data and query with Spark SQL. Really you don't need to use Spark, you could use Hive as well. But what you should stay away from is traditional Map Reduce, that paradigm is basically dead now days, and you should not write new code in it. On top of that if you don't know it, there is steep learning curve around it. I us all of technologies, and we use them for different parts are system and it depends really on the read and write requirments of the application. I would look at using spark and parquet if I were you but it a lot new tooling that might not be needed.

answered Nov 01 '22 02:11

lockwobr

3 billion of data points per year is pretty low number for modern time series databases such as VictoriaMetrics. It can persist this amount of data points in less than 3 minutes at ingestion speed of 19 millions of samples per second on a computer with 64 vCPUs. See this article for details.

There are VictoriaMetrics production setups with up to 10 trillions of data points per single node. And it scales to multiple nodes.

answered Nov 01 '22 03:11

valyala

Related questions
                            
                                Find records with exact matches on a many to many relationship
                            
                                SQL: Like vs Contains - Different Results
                            
                                If I choose RavenDB, what benefits of SQL Server do I lose?
                            
                                How can a query with a TOP 10 take infinitely longer to complete than the same query without the TOP 10?
                            
                                Can adding a primary key identity column solve deadlock issues?
                            
                                Use stored procedure to insert some data into a table
                            
                                Select From Multiple tables Comma Separated
                            
                                Select all XML attributes in SQL
                            
                                subquery - how to refer to outer query value
                            
                                How to Grant Permission to IMPERSONATE any other user?
                            
                                Full text search in sql server
                            
                                Are CREATE TABLE statements in SQL Server immune to ROLLBACK?
                            
                                Repeatedly creating and deleting databases in Entity Framework
                            
                                SQL how to compare two columns from two different tables
                            
                                Possible to INSERT with no rows in VALUES?
                            
                                SqlDependency fires immediately
                            
                                Entity Framework using Database.SqlQuery() where column names are invalid C# member names
                            
                                Allow access to Sql Server from generic account originating from specific machines
                            
                                ROW_Count() To Start Over Based On Order
                            
                                Entity Framework migrations database role

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to efficiently store and query a billion rows of sensor data

Tags:

sql-server

hadoop

azure-table-storage

bigdata

azure-hdinsight