Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Timeseries storage in Mongodb

I have about 1000 sensors outputting data during the day. Each sensor outputs about 100,000 points per day. When I query the data I am only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries. The timeseries are unevenly spaced and I need to keep the time resolution so I cannot do things like arrays of 1 point per second.

I plan to store data over many years. I wonder which scheme is the best:

  1. each day/sensor pair corresponds to one collection, thus adding 1000 collections of about 100,000 documents each per day to my db
  2. each sensor corresponds to a collection. I have a fixed number of 1000 collections that grow every day by about 100,000 documents each.

1 seems to intuitively be faster for querying. I am using mongoDb 3.4 which has no limit for the number of collections in a db.

2 seems cleaner but I am afraid the collections will become huge and that querying will gradually become slower as each collection grows

I am favoring 1 but I might be wrong. Any advice?

Update:

I followed the advice of

https://bluxte.net/musings/2015/01/21/efficient-storage-non-periodic-time-series-mongodb/

Instead of storing one document per measurement, I have a document containing 128 measurement,startDate,nextDate. It reduces the number of documents and thus the index size but I am still not sure how to organize the collections.

When I query data, I just want the data for a (date,sensor) pair, that is why I thought 1 might speed up the reads. I currently have about 20,000 collections in my DB and when I query the list of all collections, it takes ages which makes me think that it is not a good idea to have so many collections.

What do you think?

like image 521
Fred Avatar asked Mar 14 '17 04:03

Fred


3 Answers

I would definitely recommend approach 2, for a number of reasons:

  1. MongoDB's sharding is designed to cope with individual collections getting larger and larger, and copes well with splitting data within a collection across separate servers as required. It does not have the same ability to split data which exists in many collection across different servers.
  2. MongoDB is designed to be able to efficiently query very large collections, even when the data is split across multiple servers, as long as you can pick a suitable shard key which matches your most common read queries. In your case, that would be sensor + date.
  3. With approach 1, your application needs to do the fiddly job of knowing which collection to query, and (possibly) where that collection is to be found. Approach 2, with well-configured sharding, means that the mongos process does that hard work for you
like image 135
Vince Bowdren Avatar answered Oct 06 '22 17:10

Vince Bowdren


Whilst MongoDB has no limit on collections I tried a similar approach to 2 but moved away from it to a single collection for all sensor values because it was more manageable.

Your planned data collection is significant. Have you considered ways to reduce the volume? In my system I compress same-value runs and only store changes, I can also reduce the volume by skipping co-linear midpoints and interpolating later when, say, I want to know what the value was at time 't'. Various different sensors may need different compression algorithms (e.g. a stepped sensor like a thermostat set-point vs one that represents a continuous quantity like a temperature). Having a single large collection also makes it easy to discard data when it does get too large.

If you can guarantee unique timestamps you may also be able to use the timestamp as the _id field.

Sensor data

like image 26
Ian Mercer Avatar answered Oct 06 '22 18:10

Ian Mercer


When I query the data I m only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries.

But that's what exactly what Cassandra is good for! See this article and this one.

Really, in one of our my projects we were stuck with legacy MongoDB and the scenario, similar to yours, with the except of new data amount per day was even lower. We tried to change data structure, granulate data over multiple MongoDB collections, changed replica set configurations, etc. But we were still disappointed as data increases, but performance degrades with the unpredictable load and reading data request affects writing response much.
With Cassandra we had fast writes and data retrieving performance effect was visible with the naked eye. If you need complex data analysis and aggregation, you could always use Spark (Map-reduce) job. Moreover, thinking about future, Cassandra provides straightforward scalability.

I believe that keeping something for legacy is good as long as it suits well, but if not, it's more effective to change the technology stack.

like image 35
S. Stas Avatar answered Oct 06 '22 18:10

S. Stas