Situation: I've started a new job and been assigned the task of figuring out what to do with their sensor data table. It has 1.3 billion rows of sensor data. The data is pretty simple: basically just a sensor ID, a date and the sensor value at that point in time (double).
Currently, the data is stored in a table in a MSSQL Server database.
By the end of this year, I expect the number of rows to have increased to 2-3 billion.
I'm looking for a better way to store and query this data (by date), and since there is alot of "big data" products our there, and I have no real experience managing such big data sets, I'm asking here for any pointers.
It's not a large company, and our resources are not unlimited ;)
Some more details about our use case:
My research so far has led me to consider the following solutions:
Keep the data in SQL Server
but partition the table (it's not partitioned right now). That will require enterprise version of SQL Server, which costs alot.
Move the data to Azure SQL Server.
There we'll get the partitioning feature for alot less money, but once our DB grows above 250GB it costs alot more (and way too much above 500gb).
Use several Databases
We could use 1 DB per customer. Several smaller DBs will be cheaper than 1 huge DB, but we've got alot of customers and plans for more, so I don't really like thinking about managing all these databases.
Azure Storage Tables
This is the option I like best so far. We can partition the data by company/sensor/year/month, use the date for row key and store the sensor value.
I haven't yet had time to test query performance, but from what I read it should be good. But there is one major downside, and that's the limit of 1000 items returned per HTTP request. If we need to fetch all sensor data for a week, we need to make alot of HTTP requests. I'm not sure right now how big an issue this is for our use case.
Azure HDInsight (Hadoop in Azure)
As mentioned I have no experience with big data, and currently I don't get Hadoop well enough to know if it fits our case (expose sensor data, for a given timespan, through an API). Should I dig deeper and learn or is my time better spent pursuing another alternative?
Does anyone have experience from a similar case. What works for you? Keep in mind that price matters, and a "simple" solution could be preferred over a very complex one even though the complex one performs some seconds better.
UPDATE 1: To answer some of the questions in comments below.
UPDATE 2: Today I experienced a bit with both azure table storage and HDInsight (HDI). We don't require much in query "flexibility", and so I think Azure Table Storage looks real promising. It's a bit slow to pull out data due to the 1000 item limit per request as I mentioned, but in my tests I think it's fast enough for our use cases.
I've also stumbled upon OpenTSDB, which is what lead me to try out HDI in the first place. Following a tutorial on Azure (https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-tutorial-get-started/) I was able to pretty quickly store a million records and test some queries. It was a lot faster to query than Azure Table Storage. I could even pull down 300 000 records in one http request (took 30 seconds though).
But it costs quite a bit more than Azure Table Storage, and I think I can optimize my code to improve query performance with Azure Table Storage (more fine grained partition key and running requests in parallel). So right now I'm leaning towards Azure Table Storage due to the simplicity, price and "good-enough" performance.
I'm going to present my findings to an external consultant soon, so I'm excited to learn his view on things as well.
The more common way of getting data out of smart sensors is to use a bridging device known as a gateway in each room. A gateway receives data from the sensors and makes it usable. Data is transmitted from the sensors to the gateway wirelessly.
The first model treats sensor data as a continuous stream that accumulates without loss in the sensor network, and then the data is transmitted and archived outside the sensor network. The collected data can be stored in different storages and can be made requests to access them using standard methods.
Much of the data transmitted is sensor data. The huge volume of data produced and transmitted from sensing devices can provide a lot of information but is often considered the next big data challenge for businesses.
So you're going to have 3 bln records by the end of this year (which have just begun). Each record is 4 bytes ID + 4 bytes datetime + 8 bytes double value which totals in 3*10^9 * (4+4+8) == 48GB.
You can easily store and process this 48GB in a in-memory database like Redis, CouchBase, Tarantool, Aerospike. All of them are open-source, so you don't need to pay a license fee.
There might be some additional overhead on memory consumption of 10-30%, so 48GB can grow up to 64GB or slightly more. You should feed those databases with your real data to pick the most economical one for your case.
Only one physical machine should be enough for the whole workload because in-memory databases are able to handle 100K-1M queries/updates per second per node (the real number depends on your specific workload pattern) . For the sake of better availability I would setup two servers - a master and a slave.
The price of a physical server with 64GB on board up to my experience is $2-3K. Notice that you don't even need an SSD disk. A spinning one should be fine because all the reads hit RAM and all the writes only append to the transaction log. This is how in-memory databases work. I can elaborate on this if you have any questions.
So I have used all the Technologies you listed in some manner or another. What kind of queries do you need to perform? Because depending on that, you could rule some of the solutions. If you don't need to query a lot of different ways Table Storage could work out well for you. Its' going to scale real well if you follow the guidelines, and is cheap. But if you cant just do a point query for the data you need then it might not work so well, or be to complicated to be a good option. Opentsdb is great if you want a time series database. The will limit you to time series type querys. There a lot of time series dbs out there and there a lot of applications that are built on top of it like Bosun and Grafana, to list a two that I use. The last option HDI, I would store the data in parquet format (or some columnar format), create a hive table on top the data and query with Spark SQL. Really you don't need to use Spark, you could use Hive as well. But what you should stay away from is traditional Map Reduce, that paradigm is basically dead now days, and you should not write new code in it. On top of that if you don't know it, there is steep learning curve around it. I us all of technologies, and we use them for different parts are system and it depends really on the read and write requirments of the application. I would look at using spark and parquet if I were you but it a lot new tooling that might not be needed.
3 billion of data points per year is pretty low number for modern time series databases such as VictoriaMetrics. It can persist this amount of data points in less than 3 minutes at ingestion speed of 19 millions of samples per second on a computer with 64 vCPUs. See this article for details.
There are VictoriaMetrics production setups with up to 10 trillions of data points per single node. And it scales to multiple nodes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With