I have about 1000 sensors outputting data during the day. Each sensor outputs about 100,000 points per day. When I query the data I am only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries. The timeseries are unevenly spaced and I need to keep the time resolution so I cannot do things like arrays of 1 point per second.
I plan to store data over many years. I wonder which scheme is the best:
1 seems to intuitively be faster for querying. I am using mongoDb 3.4 which has no limit for the number of collections in a db.
2 seems cleaner but I am afraid the collections will become huge and that querying will gradually become slower as each collection grows
I am favoring 1 but I might be wrong. Any advice?
Update:
I followed the advice of
https://bluxte.net/musings/2015/01/21/efficient-storage-non-periodic-time-series-mongodb/
Instead of storing one document per measurement, I have a document containing 128 measurement,startDate,nextDate. It reduces the number of documents and thus the index size but I am still not sure how to organize the collections.
When I query data, I just want the data for a (date,sensor) pair, that is why I thought 1 might speed up the reads. I currently have about 20,000 collections in my DB and when I query the list of all collections, it takes ages which makes me think that it is not a good idea to have so many collections.
What do you think?
I would definitely recommend approach 2, for a number of reasons:
Whilst MongoDB has no limit on collections I tried a similar approach to 2 but moved away from it to a single collection for all sensor values because it was more manageable.
Your planned data collection is significant. Have you considered ways to reduce the volume? In my system I compress same-value runs and only store changes, I can also reduce the volume by skipping co-linear midpoints and interpolating later when, say, I want to know what the value was at time 't'. Various different sensors may need different compression algorithms (e.g. a stepped sensor like a thermostat set-point vs one that represents a continuous quantity like a temperature). Having a single large collection also makes it easy to discard data when it does get too large.
If you can guarantee unique timestamps you may also be able to use the timestamp as the _id field.
When I query the data I m only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries.
But that's what exactly what Cassandra is good for! See this article and this one.
Really, in one of our my projects we were stuck with legacy MongoDB and the scenario, similar to yours, with the except of new data amount per day was even lower.
We tried to change data structure, granulate data over multiple MongoDB collections, changed replica set configurations, etc.
But we were still disappointed as data increases, but performance degrades
with the unpredictable load and reading data request affects writing response much.
With Cassandra we had fast writes and data retrieving performance effect was visible with the naked eye. If you need complex data analysis and aggregation, you could always use Spark (Map-reduce) job.
Moreover, thinking about future, Cassandra provides straightforward scalability.
I believe that keeping something for legacy is good as long as it suits well, but if not, it's more effective to change the technology stack.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With