I am looking to use MongoDB to store time-series data. For sake of discussion imagine I have a finite numbers of sensors deployed (e.g. 10-100-1000 sensors). Each sensors has a dozen of "metrics" (e.g. temp, humidity, etc) which are collected every minute and then stored.
There is a front end which then displays charts for each sensors or aggregate on selected intervals.
What is the best approach, performance wise, to store this? Specifically:
- performance-wise, does it matter if I use a single database or more? I could create 1
db for each sensor or just use a single huge db for everything.
- performance-wise, does it matters if I partition the data by each
sensor or by metrics?
- performance-wise, should i make a collection just for the sensors
info and then collections for data or just merge the two in the same
collection?
Thanks a lot
Approach 1(A): Creating a single database for everything. (With single collection)
Pros:
- Less maintenance: Backup, creating database users, restore etc
Cons:
- You may see database level lock for creating indexes on large database
- To perform operations on specific sensor data, you need to add additional indexes to fetch only sensor specific collection
- You're bound to create not more than 64 indexes on a single collection. Although sounds bad indexing strategy.
Approach 1(B): Creating a single database for everything. (With 1 collection for each sensor)
Pros:
- Less maintenance: Backup, creating database users, restore etc
- Minimizes the need for creating indexes to identify sensor specific data from entire monolithic collection
- Every sensor specific query will be only targeted on a specific collection. Does not require to pull large working set into memory as compared to a single large collection.
- Building index on relatively smaller collection is more feasible than that of the large collection in single DB
Cons:
- You may end up creating too many indexes. (Sum of total number of indexes on all collections).
- More maintenance is required for a large number of indexes.
- WiredTiger creates 1 file for a collection and 1 for index internally. If your use case grows with a large number of sensors. You may end up using 64K open file limit.
Performance-wise, does it matters if I partition the data by each sensor or by metrics?
- This depends on the access patterns expected from your analytics app.
Performance-wise, should i make a collection just for the sensors info and then collections for data or just merge the two in the same collection?
Creating a collection for sensor metadata and sensor data may be needful. It will minimize duplicating sensor metadata in each and every collected sensor data.
You may like to read Williams blog post here on designing this pattern.
As always, it's better to design a sample schema and test your queries within your test environment.