I am facing to a problem: database for process plants. There are up to 50,000 sensors at sampling rate of 50 ms. All measured values need to be stored at least 3 years and must support real-time queries (i.e. users can see historical data with delay less than 1 second). I recently read an article about Time-series Database, many options are on hand: OpenTSDB, KairosDB, InfluxDB, ...
I am confused which one would be proper for the purpose? Any one know about this please help me!
UPDATE 15.06.25
Today I run a test based on OpenTSDB. I used Virtual Box to create a cluster of 3 CentOS x64 VMs (1 master, 2 slaves). The host configuration is 8 GB RAM, core i5. The master VM configuration is 3 GB RAM, and the slaves configuration is 1.5 GB RAM. I write a python program to send data to OpenTSDB as below:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("192.168.10.55", 4242))
start_time = time.time()
start_epoch = 1434192418;
for x in range(0, 1000000):
curr_epoch = start_epoch + x
tag1 = "put TAG_1 %d 12.9 stt=good\n" % (curr_epoch)
tag2 = "put TAG_2 %d 12.9 stt=good\n" % (curr_epoch)
tag3 = "put TAG_3 %d 12.9 stt=good\n" % (curr_epoch)
tag4 = "put TAG_4 %d 12.9 stt=good\n" % (curr_epoch)
tag5 = "put TAG_5 %d 12.9 stt=good\n" % (curr_epoch)
tag6 = "put TAG_6 %d 12.9 stt=good\n" % (curr_epoch)
tag7 = "put TAG_7 %d 12.9 stt=good\n" % (curr_epoch)
tag8 = "put TAG_8 %d 12.9 stt=good\n" % (curr_epoch)
tag9 = "put TAG_9 %d 12.9 stt=good\n" % (curr_epoch)
tag10 = "put TAG_10 %d 12.9 stt=good\n" % (curr_epoch)
str = tag1 + tag2 + tag3 + tag4 + tag5 + tag6 + tag7 + tag8 + tag9 + tag10
s.send(str)
print("--- %s seconds ---" % (time.time() - start_time))
I run the python on host, and the work completes after ~220 seconds. So, I got an avg. speed of ~45000 records per second.
UPDATE 15.06.29
This time I used only 1 VM (5 GB RAM, 3 cores, CentOS x64, pseudo-distributed Hadoop). I run 2 python processes on Windows 7 host to send 2 halves of data to the OpenTSDB. The avg. speed of putting data was ~100,000 records per second.
InfluxDB won't handle a sustained million writes per second right now, but that's within the performance target for later this year. The bigger challenge I see is the sheer volume of data you want to store. If you need to keep three years' worth at full resolution with no downsampling, that's hundreds of terabytes of data. If that's not all on SSD, the query performance will not be good. If that is all on SSD, that's an extremely expensive database. Also, with that much raw data, it would be very easy to craft a query that explodes the RAM no matter how much is installed.
I would say check back with the InfluxDB team in 8-12 weeks and we might have a better idea how to handle your problem. My advice, though, is to find a way to split that up. If you really are sampling 50k machines at 50ms intervals, it's a huge amount of data, network traffic, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With